Note: The left side frame may take a few seconds (< 30) to come up. Please be patient.

Overview

This web page presents a simple tool to compare interprocess communication overhead of a variety of distributed computing technologies. These tests are conducted as follow. For all tests there are 2 processes (as in the Unix sense) involved. Roughly speaking one of these is a client and the other is a server. A client initiates a request to transmit a number of bytes to the server. The call is synchronous and the client waits for the request to be completed. The server does not perform any processing but merely returns which completes the request as this reply winds back to the client process. The duration of this round-trip invocation is measured using gettimeofday(). For Java applications the gettimeofday() is called via a JNI call since the standard Java clock routine does not have the resolution of gettimeofday() (at least thru Java 1.3+ version). The overhead of calling gettimeofday() is approximately 0.5usec for C/C++ programs and is approximately 0.66 usec for Java programs using JNI. The round trip times are recorded in a histogram. The cost of this record operation adds an additional overhead of approximately 0.25 usec for each sample. These overheads are relatively small compared to intervals we are measuring. (A possible exceptions are some shared memory tests where we have round trip latency values in the 5-6 usec range. For these values an overhead of ~1 usec starts to be large enough to worry somewhat. However in most cases we are measuring intervals that are in 10's, if not 100s, of usec in these cases the measurement and recording overheads are negligible.)

To the extent possible we have tried to stick to our central notion - measuring average and worst case latencies for transporting n bytes from process A to process B. This is so even when we have tested database technologies. In such tests "n" bytes are first sent from process A to a database. After this, process B is notified (often via a byte sent down a socket to wake it up) that it should go read "n" bytes from the database. Upon completion of this read process B singles back to process A that the read is done, which completes one cycle of sending information from process A to process B. Natarurally the use of 1 byte socket write from process A to process B to wake it up adds some etxra overhead to this measurement. However, measurements show that this overhead is less than 5% of the overall time measured. Thus, while not perfect, we are able to use this technique to quantify latency of exchange of "n" bytes between processes via a datastore.

We use three categories of platforms to do our tests.

These two test configurations are shown in the following figures.

SMP Configuration


Two Hosts Configuration

Under each hardware setup there are multiple categories of tests. These include various network transports (only TCP for now, but SCTP and others to follow), a variety of CORBA ORBs, some CORBA Components implementation measurements, RMI, and some EJB tests.

Most of the ORBs are tested with 3 different types of "in" argument in the method call.

By doing this we hope to get a measure of ORB overhead for marshaling and unmarshaling. However, since we are using only a single, relatively simple structure this is not a comprehensive study of marshaling issues with various ORBs. The idl files used in our tests can be found here. If a particular result is not specifically labeled with "octet", "long" or "struct" it implies that the result is for "octet". Some results are also labeled "octet_orig". This signifies a slightly older version of "octet" program was used. The two results should be nearly identical, and where they are not is something that is being or needs to be investigated.

Some results have the word "_opt" or "_default" tagged to them. This signifies if a svc.conf file providiing various optimizations was applied. This is esp. true for many of our TAO tests. "Default" means no svc.conf file was used.

How To Generate Plots

To obtain a plot you just click in the checkbox for the results you want to see on your graph and hit the "Plot" button. In addition, for many of the entries, the name of the checkbox option itself is a HTML reference and hence clickable. Following this link will take you to detailed results for that particular test. In future we hope to add a powerful "SQL" like capability to generate the graphs.

As number of results we have available has grown we have added a "filter" capability that can be used to obtain a subset of results that are available. This can often be of considerable use. Please follow the "filter" link near the top of the left frame for an explanation of this feature.

The above procedure will generate a graph that shows curves of "mean" values of roundtrip latencies. It is often useful to see the range of observations. By this we mean the (min, mean, max) values shown as an errorbar. (We use gnuplot to generate these plots.) In order to see the ranges please select the "Range (< 5)" check box near the top. The "(< 5" means that you can only see range values for 1, 2, 3 or 4 different curves. Beyond this the graph becomes very cluttered. (This restriction may be relaxed in future.)

Besides the "mean" value it is sometimes useful to plot just either the minimum or the maximum values for a test. Checkboxes are provided near the top of the left hand panel to make these selections.

By default the CGI script will permit gnuplot to auto select the Y-axis range. However the ymin and ymax values can be manually set at the bottom of the left frame (the frame showing all the check boxes for the possible selections.) If you wish to specify only one of ymin or ymax you may enter a "*" which will permit gnuplot to autoselect that value. For example, ymin=* and ymax=3000 will limit ymax to 3000 but will permit gnuplot to autoselect ymin value based on the values being graphed.

After you hit the "plot" button you should see a graphic on the right hand side panel. This is a .png file generated by gnuplot. Plots are ordered from low latency values to high latency values. Each of the individual graphs are labeled. In order to keep label strings as compact as possible a bit of processing is done and common elements of the graph labels are factored out and shown in the overall title at the top of the graphic. For example, if you plot the graphs:

  smp/orb/mico/2.3.6           and
  smp/orb/tao/1.2 

the common text in title will be

  smp/orb/./.
where the "." means uncommon elements. The titles of the two curves shown will be "mico/2.3.6" and "tao/1.2". Non-leading "." elements in the legend for each curve are shown and they correspond to common element at the relative position in the title.

Besides making the legend text more compact this processing has the added benefit of showing what is common for all the data shown.

IPC Latency Measurements under Loaded Conditions

Some of the interprocess latency tests are conducted with varying types of load on the comptuers involved in the test. Basically there can be 4 test conditions.

  1. Unloaded
    If a test name carries no specific modifier such as "_diskload" than it is an unloaded test. Thus, a test label such as "tcp_C" is same as "tcp_C_unloaded".
  2. Disk Loaded
    In this test the comptuers involved in the IPC tests are also subject to a (lower priority) set of processes that are essentially disk I/O bound.
  3. Network Loaded
    In this test the comptuers involved in the IPC tests are also subject to a (lower priority) set of processes that are network I/O bound. In general the network I/O load occurs over differnet interfaces than those that are carrying the traffic for IPC latency measurements. The purpose of this test is to measure how well the IPC flow that we care about is impacted by network traffic over other interface(s) of the computers involved.
  4. Combinted Disk I/O Load and Network I/O Load.
    In this test both disk load and network load are in force at the same time.
Additional details on the loaded tests procedure is available at: Loaded Test Procedures

We show below some of the pre-generated graphs with minor discussion. The purpose of this tool, we hope, is to permit the reader to generate graphs making comparisons that interest him or her.


  • Four ORBs (TAO, MICO, JacORB, and OpenORB) are compared with java RMI.


  • Seven ORBs (ORBExpress, OmniORB, TAO, MICO, JacORB, OpenORB and JDK1.4 built in ORB) are compared with TCP/IP.

  • Various Java based interprocess communication technologies are compared. Results include Java TCP w/o serialization (using OutputStream.write() and InputStream.read()), Java TCP with serialization (using ObjectOutputStream.writeObject() and InputObjectStream.readObject()), Java RMI, JacORB, OpenORB (both of these are Java ORBs) and JBOSS EJB server.

  • In Java applications sending "Bytes" versus "bytes" makes, as expected, a difference. We show results for Java RMI and for JBOSS EJB server.

  • In this graph we compare various results of using the JBOSS EJB application server/container. The main comparison being made here is the JBOSS intra container communication latencies with a client to a bean latencies. In the intra container case every thing runs within a single process. In the client to a bean case there are two seperate processes. Of course the bean is hosted by a JBOSS container process.

    As it turns out intra container invocations in JBOSS (at least in 2.4.x series) merely transfer the byte array by reference. Thus the roundtrip latency costs remain flat for all sizes of the byte array (which are plotted along the x axis.) This "by reference" semantics are obviously different from "copy" semantics that are used for parameter passing between a client and a bean when they are in two separate processes. In order to somewhat duplicate the "copy" semantics we also show the results for intra container invocations where the receiving bean makes a copy of the byte array being sent to it. We see that this curve, shown in blue, essentially tracks the JBOSS client to bean (two process) curve which is shown in grown color. We also show the shared memory communication curve which forms the bound that intra container with copying approaches.


  • This graph shows how "range" values can be plotted. Here we show two graphs for roundtrip latencies suffered by two processes on a Timesys/Linux Uniprocessor. The difference between the two runs was that one case the processes ran with normal, time shared priority while in the second case the processes ran with real-time priority. It can be seen that priorities significantly lower the "max" latencies observed.

  • This graphs compares a variety of technologies on two machines that are connected via a 100 Mbps ethernet. This is also our initial look at the EMAA agent tool kit. More information on ATL's EMAA agent toolkit can be found (on LM's internal net) here.

  • This graph shows a comparison of the most recent results for various ORBs we have tested as of June 9, 2006. All tests were performed on our SMP test machine.

  • A graph showing a comparison of RTT for raw TCP sockets with current versions of JBoss EJB3, JacORB, and TAO. Each of these tests were run on a two-host setup on Emulab PC3000 machines with a uniprocessor 2.6.19 kernel over gbit ethernet on Jan 11-17, 2007.

  • This graph shows the results from running our standard TCP RTT benchmark program on various versions of the 2.6.14 kernel. The intent of this graph is to show the overhead resulting from SMP and Non-SMP kernels, with -rt patches and without. For these results only these variables were changed, the hardware remained constant. Additionally we attempt to remove any network variability by placing both the sender and receiver on a single host. Experiments were run Dec 7-8, 2005.

  • This image shows the trade-off of very slight increase in mean round trip latencies for 2.6.20-rt8 kernel for significantly reduced maximum latencies. For example, at 4 byte messages size, the 2.6.20 stock kernel (compiled at "low latency desktop" level of preemption, and compiled in uniprocessor configuration, has a round trip latency of 59 usec. This compares to 63 usec for the 2.6.20-rt8 uniprocessor kernel (compiled with the Complete Preemption option.) The corresponding maximum latency values measured over these 1E7 (10,000,000) samples run were 1491usec and 686 usec. Thus, in these test runs it is demonstrated that slightly higher mean latencies buys us a (much) reduced maximum latency. These tests were conducted in February 2007. Tests are between two Emulab PC3000 nodes connected via a gigabit network.