The Discipline of Applied Coherence

Visualizing the Coherence Datagram Test

Figure 1

The graph above was generated from the output of the Coherence Datagram Test utility. The Coherence Datagram Test is a tool that sends and receives UDP packets between two ore more machines to evaluate the health and performance of the network between those machines. The above test was run for 100 secs on two server-class machines with a 1 Gb Ethernet connection to the same switch. I think it’s pretty clear from the graph that there is significant packet loss between the two machines. Here’s what the graph looks like on a healthy network:

Figure 2

The difference between the two graphs is very clear. The Coherence Production Checklist suggests running the test before deploying a Coherence application into a real environment. I have found that most users have a hard time interpreting the output of the test, but I think most users who have run the Datagram Test would agree that the above graphs are much easier to understand. The Datagram Test can help identify some types of problems in a network that could adversely affect a Coherence application, such as packet loss. Here, I will describe how I generated these graphs, which can come in handy when analyzing a large number of tests results.

The first step is to actually run the Datagram Test to generate report data:

server1$ java -server -cp coherence.jar -local -log -txDurationMs 100000 -polite
server2$ java -server -cp coherence.jar -local -log -txDurationMs 100000

The above pair of commands will run a bi-directional test for 100 seconds, generating a tab-delimited report in the file specified by -log. As of Coherence 3.6, the tab-delimited report spits out aggregated lifetime (since the test began) metrics every 100,000 (by default) received packets. For analyzing packet loss, it makes more sense to look at the metrics accumulated between reporting intervals rather than since the beginning of the test, since lifetime metrics could mask spikes that occur later in the test. Luckily, the per interval metrics we need to look at can be derived from the lifetime metrics. The following awk script will calculate the additional columns of interest (as well as fix a bug in the test where the data columns don’t align with the header columns due to two missing delimiters):

#!/usr/bin/awk -f
    FS = "[\t\r\n]";

# Header line
/^publisher/ {
    if (FILENAME == "") {
        FILENAME = "stdin";
    else {
        print("Processing " FILENAME);
    gsub(/[\r\n]/, "", $0);
    header = sprintf("%s\tinterval duration secs\tinterval missing packets\tinterval drop rate\tinterval success rate\tinterval throughput mb/sec", $0);
    for (outfile in aOutfile) {
    delete aPrevSent;
    delete aPrevReceived;
    delete aPrevMissing;
    delete aPrevDurationMillis;
    delete aDurationOffset;
    delete aOutfile;

# Initialize prev values
aPrevSent[$1] == ""  {
    aPrevSent[$1] = 0;
		aPrevReceived[$1] = 0;
    aPrevMissing[$1] = 0;
    aPrevDurationMillis[$1] = 0;
    aDurationOffset[$1] = 0;
    aOutfile[$1] = FILENAME "." substr($1, 2, length($1))  ".csv";
    if (aOutfile[$1] ~ /^stdin/) {
    else {
        print(header) > aOutfile[$1];

# Account for packet sequence restart
$2 < aPrevDurationMillis[$1] {
    aPrevSent[$1] = 0;
    aPrevReceived[$1] = 0;
    aPrevMissing[$1] = 0;
    aDurationOffset[$1] += aPrevDurationMillis[$1];

# Skip duplicate lines
$2 == aPrevDurationMillis[$1] {

    split($11, aOoo, /^[0-9]/);
    sOoo = sprintf("%s\t%s", substr($11, 1, 1), aOoo[2]);

    split($13, aGapMillis, /^[0-9]/);
    sGapMillis = sprintf("%s\t%s", substr($13, 1, 1), aGapMillis[2]);

    cIntervalDurationMillis = $2 - aPrevDurationMillis[$1];
    cIntervalSent = $6 - aPrevSent[$1];
    cIntervalReceived = $7 - aPrevReceived[$1];
    cIntervalMissing = $8 - aPrevMissing[$1];
    dflIntervalDropRate = cIntervalMissing / cIntervalSent;
    dflIntervalSuccessRate = 1 - dflIntervalDropRate;
		dflIntervalThroughput = (($3 * cIntervalReceived) / (cIntervalDurationMillis / 1000)) / (1024 * 1024);

    aPrevDurationMillis[$1] = $2;
    aPrevSent[$1] = $6;
    aPrevReceived[$1] = $7;
    aPrevMissing[$1] = $8;

    if (aOutfile[$1] ~ /^stdin/) {
                $1, $2 + aDurationOffset[$1], $3, $4, $5, $6, $7, $8, $9, $10, sOoo, $12, sGapMillis,
                cIntervalDurationMillis / 1000, cIntervalMissing, dflIntervalDropRate, dflIntervalSuccessRate, dflIntervalThroughput);
    else {
                $1, $2 + aDurationOffset[$1], $3, $4, $5, $6, $7, $8, $9, $10, sOoo, $12, sGapMillis,
                cIntervalDurationMillis / 1000, cIntervalMissing, dflIntervalDropRate, dflIntervalSuccessRate, dflIntervalThroughput) > aOutfile[$1];

This script will take the output of the -log option and produce a new file. Assuming you save the contents of the above script to augment-datagram-test.awk and set the execute bit, you can use the script as follows:

server1$ ./augment-datagram-test.awk

The above command will generate a new file called which contains the additional columns “interval duration secs”, “interval missing packets”, “interval drop rate”, “interval success rate” and “interval throughput mb/sec”. The script will produce one csv file for each publisher present in the tab-delimited report. The script will also accept multiple tab-delimited files as input, processing each one independently, and can also accept input piped through stdin (with output going to stdout).

To actually generate the graphs, I use R. I encountered R earlier this year working with a customer, but didn’t have the chance to play around with it myself. Before I decided to use R, I was taking the output from my awk script and importing into a spreadsheet application and then generating graphs. This proved to be quite tedious and involved too many mouse clicks for my taste, so I turned to R to let me script the process and eliminate the need for a spreadsheet application altogether. R is also much more flexible when it comes to producing graphs, as you have complete control over the plot area. After a few days of playing around with R, I was able to come up with the following script to generate the graphs seen at the beginning of this post:

args <- commandArgs(TRUE)
for (file in args)
    outfile <- paste(file, ".png", sep = "")
    cat("Plotting ", file, " as ", outfile, "\n", sep = "")
    # Read and process input file
    dgt     <- read.table(file, header = TRUE, sep = "\t")
    x       <- dgt$ / 1000
    y       <- dgt$interval.drop.rate * 100
    x.range <- c(0, max(x))
    y.range <- c(0, max(y, 20))
    nonzero <- which(y > 0)
    loss.intervals   <- (length(nonzero) / length(y)) * 100
    throughput.range <- c(0, max(dgt$interval.throughput.mb.sec, 120))
    title <- sub("\\.log\\.", " <- ", file)
    title <- sub("\\.csv", "", title)

    # Create plot as PNG
    png(filename = outfile, height = 400, width = 600, bg = "white")

    # Set margins to make room for right-side axis labels
    par(mar = c(7,5,4,5) + 0.1)

    # Plot packet loss line
    plot(x, y, type = "l", main = title, xlab = "Time (secs)", ylab = "Loss (%)",
            col = "blue", xlim = x.range, ylim = y.range, lwd = 2)

    # Circle points where packet loss > 0
    points(x[nonzero], y[nonzero], cex=1.5)

    # Plot throughput line
    lines(x, dgt$interval.throughput.mb.sec * (y.range[2] / throughput.range[2]),
            col = "green", lwd = 2)

    # Create right-side axis labels and tick marks
    axis(4, at = y.range[2] * c(0:4) / 4,
            labels = (throughput.range[2] / 4) * c(0:4))
    mtext("Throughput (MB/s)", side = 4, line = 3)

    # Draw the background grid lines

    # Report the number of intervals that experienced loss (as a %)
    mtext(sprintf("Intervals w/ Loss: %.2f%%", loss.intervals), side = 1,
            line = 3, adj = 1)

    # Create the legend at the bottom
    legend("bottom", inset = -0.4, c("loss", "throughput"),
            col = c("blue", "green"), lty = 1, lwd = 2, bty = "n", horiz = TRUE,
            xpd = TRUE)

    # Close the PNG

Assuming you save the contents of the above script as plot-datagram.r, you can invoke the script as follows:

server1$ r -q --slave -f plot-datagram.r --args

The output from the above command will be a new file called which represents a graph of both packet loss and throughput over the duration of the test. The circles indicate intervals where packet loss occurred. This script can also accept multiple files as input, generating a graph for each in a separate file.

With both scripts in hand, generating graphs to visualize packet loss from the output of the Datagram Test can be done in a few seconds:

server1$ ./augment-datagram-test.awk *.log
server1$ r -q --slave -f plot-datagram.r --args *.csv

December 13, 2010 - Posted by | General


  1. The log file on each node looks like this:
    publisher duration ms packet size throughput mb/sec throughput packets/sec sent packets received packets missing packets success rate out of order avg out of order offset gaps avg gap size avg gap time ms avg ack ms

    Looks like the datagram test fails….how can I proceed with this?

    Comment by NID | September 8, 2011 | Reply

  2. I am getting the following errors:
    awk: /home/coreserv/augment-datagram-test.awk:68: (FILENAME=uat4_150_binding_p2.log FNR=1672) fatal: division by zero attempted

    If i remove the offending line, the script progresses but fails at another line. is this a known issues?

    Comment by N | September 29, 2011 | Reply

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: