A-Team Chronicles
New location for Coherence blogs: http://www.ateam-oracle.com/?cat=53
Visualizing the Coherence Datagram Test
The graph above was generated from the output of the Coherence Datagram Test utility. The Coherence Datagram Test is a tool that sends and receives UDP packets between two ore more machines to evaluate the health and performance of the network between those machines. The above test was run for 100 secs on two server-class machines with a 1 Gb Ethernet connection to the same switch. I think it’s pretty clear from the graph that there is significant packet loss between the two machines. Here’s what the graph looks like on a healthy network:
The difference between the two graphs is very clear. The Coherence Production Checklist suggests running the test before deploying a Coherence application into a real environment. I have found that most users have a hard time interpreting the output of the test, but I think most users who have run the Datagram Test would agree that the above graphs are much easier to understand. The Datagram Test can help identify some types of problems in a network that could adversely affect a Coherence application, such as packet loss. Here, I will describe how I generated these graphs, which can come in handy when analyzing a large number of tests results.
The first step is to actually run the Datagram Test to generate report data:
server1$ java -server -cp coherence.jar com.tangosol.net.DatagramTest -local 192.168.1.100 -log 192.168.1.100.log -txDurationMs 100000 -polite 192.168.1.101
server2$ java -server -cp coherence.jar com.tangosol.net.DatagramTest -local 192.168.1.101 -log 192.168.1.101.log -txDurationMs 100000 192.168.1.100
The above pair of commands will run a bi-directional test for 100 seconds, generating a tab-delimited report in the file specified by -log. As of Coherence 3.6, the tab-delimited report spits out aggregated lifetime (since the test began) metrics every 100,000 (by default) received packets. For analyzing packet loss, it makes more sense to look at the metrics accumulated between reporting intervals rather than since the beginning of the test, since lifetime metrics could mask spikes that occur later in the test. Luckily, the per interval metrics we need to look at can be derived from the lifetime metrics. The following awk script will calculate the additional columns of interest (as well as fix a bug in the test where the data columns don’t align with the header columns due to two missing delimiters):
#!/usr/bin/awk -f BEGIN { FS = "[\t\r\n]"; } # Header line /^publisher/ { if (FILENAME == "") { FILENAME = "stdin"; } else { print("Processing " FILENAME); } gsub(/[\r\n]/, "", $0); header = sprintf("%s\tinterval duration secs\tinterval missing packets\tinterval drop rate\tinterval success rate\tinterval throughput mb/sec", $0); for (outfile in aOutfile) { close(aOutfile[outfile]); } delete aPrevSent; delete aPrevReceived; delete aPrevMissing; delete aPrevDurationMillis; delete aDurationOffset; delete aOutfile; next; } # Initialize prev values aPrevSent[$1] == "" { aPrevSent[$1] = 0; aPrevReceived[$1] = 0; aPrevMissing[$1] = 0; aPrevDurationMillis[$1] = 0; aDurationOffset[$1] = 0; aOutfile[$1] = FILENAME "." substr($1, 2, length($1)) ".csv"; if (aOutfile[$1] ~ /^stdin/) { print(header); } else { print(header) > aOutfile[$1]; } } # Account for packet sequence restart $2 < aPrevDurationMillis[$1] { aPrevSent[$1] = 0; aPrevReceived[$1] = 0; aPrevMissing[$1] = 0; aDurationOffset[$1] += aPrevDurationMillis[$1]; } # Skip duplicate lines $2 == aPrevDurationMillis[$1] { next; } { split($11, aOoo, /^[0-9]/); sOoo = sprintf("%s\t%s", substr($11, 1, 1), aOoo[2]); split($13, aGapMillis, /^[0-9]/); sGapMillis = sprintf("%s\t%s", substr($13, 1, 1), aGapMillis[2]); cIntervalDurationMillis = $2 - aPrevDurationMillis[$1]; cIntervalSent = $6 - aPrevSent[$1]; cIntervalReceived = $7 - aPrevReceived[$1]; cIntervalMissing = $8 - aPrevMissing[$1]; dflIntervalDropRate = cIntervalMissing / cIntervalSent; dflIntervalSuccessRate = 1 - dflIntervalDropRate; dflIntervalThroughput = (($3 * cIntervalReceived) / (cIntervalDurationMillis / 1000)) / (1024 * 1024); aPrevDurationMillis[$1] = $2; aPrevSent[$1] = $6; aPrevReceived[$1] = $7; aPrevMissing[$1] = $8; if (aOutfile[$1] ~ /^stdin/) { printf("%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%.3f\t%d\t%f\t%f\t%d\n", $1, $2 + aDurationOffset[$1], $3, $4, $5, $6, $7, $8, $9, $10, sOoo, $12, sGapMillis, cIntervalDurationMillis / 1000, cIntervalMissing, dflIntervalDropRate, dflIntervalSuccessRate, dflIntervalThroughput); } else { printf("%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%.3f\t%d\t%f\t%f\t%d\n", $1, $2 + aDurationOffset[$1], $3, $4, $5, $6, $7, $8, $9, $10, sOoo, $12, sGapMillis, cIntervalDurationMillis / 1000, cIntervalMissing, dflIntervalDropRate, dflIntervalSuccessRate, dflIntervalThroughput) > aOutfile[$1]; } }
This script will take the output of the -log option and produce a new file. Assuming you save the contents of the above script to augment-datagram-test.awk
and set the execute bit, you can use the script as follows:
server1$ ./augment-datagram-test.awk 192.168.1.101.log
The above command will generate a new file called 192.168.1.101.log.192.168.1.100:10000.csv
which contains the additional columns “interval duration secs”, “interval missing packets”, “interval drop rate”, “interval success rate” and “interval throughput mb/sec”. The script will produce one csv file for each publisher present in the tab-delimited report. The script will also accept multiple tab-delimited files as input, processing each one independently, and can also accept input piped through stdin (with output going to stdout).
To actually generate the graphs, I use R. I encountered R earlier this year working with a customer, but didn’t have the chance to play around with it myself. Before I decided to use R, I was taking the output from my awk script and importing into a spreadsheet application and then generating graphs. This proved to be quite tedious and involved too many mouse clicks for my taste, so I turned to R to let me script the process and eliminate the need for a spreadsheet application altogether. R is also much more flexible when it comes to producing graphs, as you have complete control over the plot area. After a few days of playing around with R, I was able to come up with the following script to generate the graphs seen at the beginning of this post:
args <- commandArgs(TRUE) for (file in args) { outfile <- paste(file, ".png", sep = "") cat("Plotting ", file, " as ", outfile, "\n", sep = "") # Read and process input file dgt <- read.table(file, header = TRUE, sep = "\t") x <- dgt$duration.ms / 1000 y <- dgt$interval.drop.rate * 100 x.range <- c(0, max(x)) y.range <- c(0, max(y, 20)) nonzero <- which(y > 0) loss.intervals <- (length(nonzero) / length(y)) * 100 throughput.range <- c(0, max(dgt$interval.throughput.mb.sec, 120)) title <- sub("\\.log\\.", " <- ", file) title <- sub("\\.csv", "", title) # Create plot as PNG png(filename = outfile, height = 400, width = 600, bg = "white") # Set margins to make room for right-side axis labels par(mar = c(7,5,4,5) + 0.1) # Plot packet loss line plot(x, y, type = "l", main = title, xlab = "Time (secs)", ylab = "Loss (%)", col = "blue", xlim = x.range, ylim = y.range, lwd = 2) # Circle points where packet loss > 0 points(x[nonzero], y[nonzero], cex=1.5) # Plot throughput line lines(x, dgt$interval.throughput.mb.sec * (y.range[2] / throughput.range[2]), col = "green", lwd = 2) # Create right-side axis labels and tick marks axis(4, at = y.range[2] * c(0:4) / 4, labels = (throughput.range[2] / 4) * c(0:4)) mtext("Throughput (MB/s)", side = 4, line = 3) # Draw the background grid lines grid() # Report the number of intervals that experienced loss (as a %) mtext(sprintf("Intervals w/ Loss: %.2f%%", loss.intervals), side = 1, line = 3, adj = 1) # Create the legend at the bottom legend("bottom", inset = -0.4, c("loss", "throughput"), col = c("blue", "green"), lty = 1, lwd = 2, bty = "n", horiz = TRUE, xpd = TRUE) # Close the PNG dev.off() }
Assuming you save the contents of the above script as plot-datagram.r
, you can invoke the script as follows:
server1$ r -q --slave -f plot-datagram.r --args 192.168.1.101.log.192.168.1.100:10000.csv
The output from the above command will be a new file called 192.168.1.101.log.192.168.1.100:10000.csv.png
which represents a graph of both packet loss and throughput over the duration of the test. The circles indicate intervals where packet loss occurred. This script can also accept multiple files as input, generating a graph for each in a separate file.
With both scripts in hand, generating graphs to visualize packet loss from the output of the Datagram Test can be done in a few seconds:
server1$ ./augment-datagram-test.awk *.log server1$ r -q --slave -f plot-datagram.r --args *.csv
Hello world!
I’ve been meaning to start this blog for a while now, and the stars have finally aligned to let that happen. My intentions for this blog are to document findings, observations and generally interesting things about Oracle Coherence (and related technologies) discovered through practice.
-
Recent
-
Links
-
Archives
- August 2014 (1)
- December 2010 (1)
- January 2010 (1)
- December 2009 (1)
-
Categories
-
RSS
Entries RSS
Comments RSS