Monitoring distributed systems with Google Big Query and R

In this blog post I’m going to explain how we use Google’s Big Query for storing and mining log data, and then use the open source statistical programming language, R, to run statistical tests to determine if new code is behaving correctly.

At Nomanini we believe that Continuous Delivery is a key business differentiator that, over the past three years, has allowed us to catch up and overtake some of our more established competitors.

This year we have increased our delivery rate from 10 production releases per month in January to over 50 releases in September, without increasing our engineering head count. We’ve also scaled from a few hundred devices to almost a thousand. Manually upgrading and monitoring each device is now impossible.

Even with thousands of unit and automated acceptance tests running in CI there was always some apprehension when it came to rolling out new code to the point of sale devices in the field. Broken code is going to prevent sales or worse, turn a device in to an expensive brick.

Rolling out firmware to devices in the field

After a new version of firmware runs the gauntlet of automated tests, it is promoted from CI to Alpha then to Beta status. Our users and their devices are by default subscribed to our Stable upgrade channel, but we’ve convinced a few of them to be Beta testers and their terminals are subscribed to the Beta upgrade channel. (Thanks, Chromium Project for this great idea.) As soon as code is promoted to Beta, devices on the Beta channel begin upgrading to the new firmware.

To limit any potential impact we rate-limit our roll-out’s by only upgrading one device every two minutes. Once a device has been told to upgrade, it logs in to the server, downloads the new binaries, and when the merchant isn’t using the terminal, seamlessly boots up the new version of firmware.

Here we have two versions of Master code (our internal name for firmware), one in Beta and the older version in Stable.

How can we be sure that the new Beta version of firmware is not broken in some subtle way that wasn’t detected during automated testing? And how do we know when it’s okay to promote a Beta version to Stable so that all terminals receive the upgrade?

The process that we use is:
1. Monitor metrics on both the Stable and Beta embedded firmware while running in the field.
2. Upload the metrics from the devices to our server (Google App Engine) and save to a database. (Google Big Query)
3. Extract the metrics in to a statistical analysis tool. (R or RStudio)
4. Run statistical tests to determine if there is any difference between the Stable and Beta releases.
5. If there is no difference then there is no reason not to promote Beta code to Stable.

In-field monitoring

As with most things embedded, in-the-field monitoring is surprisingly challenging. Terminals are battery operated and use the often unreliable GSM/GPRS network to connect back to our servers. We only have a few hundred kB of flash available for logs. And in some countries data over GSM costs more than $1 per MB, sometimes as much as$10 per MB. So we can’t just upload raw log files or fire off UDP packets like you might do in a data centre.

Apple’s iPad and iPhone usage statistics (Settings - General - About - Diagnostics & Usage - Diagnostics & Usage Data) aggregate key:value metrics over a full day. It looks like they have histogram type counts (like backlight brightness, with 10% buckets); and counters for number of events, seconds, power in mA.

An excerpt from one of my daily iPad log files:
 com.apple.power.backlight.values.000_010   5387   com.apple.power.backlight.values.010_020   162   com.apple.power.backlight.values.020_030   11898   com.apple.power.backlight.values.030_040   52714   com.apple.power.backlight.values.040_050   0   com.apple.power.backlight.values.050_060   0   com.apple.power.backlight.values.060_070   0   com.apple.power.backlight.values.070_080   0   com.apple.power.backlight.values.080_090   0   com.apple.power.backlight.values.090_100   0      com.apple.power.state.charging.count   1   com.apple.power.state.charging.duration   18898   com.apple.power.state.charging.energy   -20573   com.apple.power.state.overall.average_length   33698   com.apple.power.state.overall.average_power   49  
Taking that as inspiration, we wrote a small C++ library that can be called by application code that wants to be instrumented. Counts are saved in a C++ key:value map. At midnight UTC the map is serialised to JSON and uploaded to our server.

The JSON data packet looks like this:
{  "counts": {    "ERROR": 7,    "WARNING": 1475,    "INFO": 19622,    "DEBUG": 362754,    "[E].EventsManager.423": 2,    "[E].GPSManager.259": 1,    "[E].SlaveCommsDispatcher.158": 2,    "[E].SlaveCommsDispatcher.311": 1,    "[E].SlaveCommsDispatcher.395": 1,    "CSQ.BitErrors.0": 42,    "CSQ.BitErrors.1": 1,    "CSQ.BitErrors.3": 2,    "CSQ.BitErrors.5": 1,    "CSQ.SignalStrength.6-11": 18,    "CSQ.SignalStrength.12-17": 12,    "CSQ.SignalStrength.18-23": 15,    "CSQ.SignalStrength.24-29": 1,    "GPRS.TimeToConnect.0-20": 2  },  "firmwareVersion": "4264-548923b591c6",  "startTime": "2014-09-22 00:00:01.152",  "endTime": "2014-09-23 00:00:06.574"}
There are several things going on here:

The ERROR, WARNING, INFO and DEBUG counts are the total number of debug lines that have been logged by the terminal while the code is running.

For each ERROR or CRITICAL line that is logged the library makes a key in the format [loglevel].filename.linenumber, such as [E].GPSManager.259 and increments the count for that key. We also increment the global ERROR counter.

Logging the filename and line number tells us the region of code that is causing errors. Even without a stack trace we have a good idea what caused the problem. Also, this is not meant to detect detailed errors on particular devices, but rather detect similar errors across many devices so that we can detect buggy versions.

We also use the logging library to build up interesting histograms: An example is the CSQ, or signal strength, a value that ranges from 0 (0%) to 32 (100%). Each time we read the CSQ we increment the correct bucket. This is used for testing whether changes in antenna placement on different hardware revisions has improvement signal quality across those devices.

Histograms are also used for timing events: GPRS.TimeToConnect.0-20: 2 means that it took between 0 and 20 seconds both times that the modem connected to the GPRS network. Since there are no other buckets it is implied that all GPRS connections were shorter than 20s.

By default, at midnight (UTC) devices send up their counts in a JSON packet and the counters are reset back to 0. The server streams this data to a table in Google Big Query along with the unique ID of the device that it came from.

To detect problems earlier, Beta devices upload their statistics more often (every 2, 4, 8 or 12 hours) so that we get diagnostics sooner after a new Beta release starts rolling out.

Streaming diagnostic data to Google Big Query

The terminal uploads the JSON diagnostic packet to our application running on Google App Engine during the course of its normal connections to our server. Once the data arrives on the server, it is packaged in to a push task that connects to Google Big Query’s streaming insert API, where it is inserted in to our diagnostics event_log table.

Creating a View in Big Query

Because all diagnostics events, not just the counters are streamed in to this table, I created a View to pull out only the diagnostic counter data, simplifying downstream processing.

To create a View through the Big Query web UI, compose a query as usual and then click Save View. It will ask for the Project, Dataset to save the view in, and the Table ID, which is the table name of the view. Once saved, the view can be queried as any other table query.

I used the following SQL to create my view:
 SELECT     JSON_EXTRACT_SCALAR(event_data, '$.firmwareVersion') AS firmware_version, device_id, SUM(INTEGER(JSON_EXTRACT_SCALAR(event_data, '$.counts.CRITICAL'))) AS critcals,     SUM(INTEGER(JSON_EXTRACT_SCALAR(event_data, '$.counts.ERROR'))) AS errors, SUM(INTEGER(JSON_EXTRACT_SCALAR(event_data, '$.counts.DEBUG'))) AS debugs,     1e6 * SUM(INTEGER(JSON_EXTRACT_SCALAR(event_data,'$.counts.ERROR'))) / SUM(INTEGER(JSON_EXTRACT_SCALAR(event_data, '$.counts.DEBUG'))) AS error_rate   FROM [nomanini.event_log]   WHERE event = 'Counters'   GROUP BY firmware_version, device_id   ORDER BY firmware_version DESC, device_id DESC;  
In the event_log table, the counters were stored as a JSON object in the event_data column. Another great feature of Big Query SQL is JSON_EXTRACT_SCALAR that allows you to extract data out of a JSON object by specifying the key path.

From the event_log table I've got the device_id, a unique, per terminal identifier, and from the JSON in the event_data column I've pulled out the firmware_version, critical, error and debug counts.

I summed the counts using GROUP BY firmware_version, device_id so I get the total number of errors that each device produced on each version of firmware. I've also calculate the error_rate, which is defined as the number of errors per million debug lines, or error/debug * 1e6. (More on why, later.)

Querying this view:
 SELECT * FROM [nomanini.firmware_error_rates] LIMIT 10;  
Gives:

Getting the data from Google Big Query in to R

There’s a fantastic bigrquery package in R's extensive CRAN repository which allows you to run SQL queries from within R on Big Query and get the results in an R data frame.

R will start the OAuth process and open your browser for authentication. The query runs and the results are saved in to a result data frame.

To query the firmware_error_rates view in Big Query from within R, I run the following code:
 library(bigrquery)      # The two versions that we want to compare: Beta and Stable    firmware_version_beta = "4264-548923b591c6"   firmware_version_stable = "4252-c2b7961f0a5b"      # The base SQL statement   sql_query_a = "SELECT firmware_version, device_id, critcals, errors, debugs, error_rate FROM [nomanini-dashboard:nomanini.firmware_error_rates] WHERE firmware_version = '"   sql_query_b = "' ORDER BY device_id LIMIT 100;"      # Create the SQL strings, concatenate using paste   sql_beta <- paste(sql_query_a, firmware_version_beta, sql_query_b, sep="")   sql_stable <- paste(sql_query_a, firmware_version_stable, sql_query_b, sep="")      # Run the queries for both the old and new versions   v_beta <- query_exec("nomanini-dashboard", "nomanini", sql_beta)   v_stable <- query_exec("nomanini-dashboard", "nomanini", sql_stable)      # Fill NA's (null's) with 1 so can log transform the data   v_beta[is.na(v_beta)] <- 1   v_stable[is.na(v_stable)] <- 1      # Join two data frames by the common key variable device_id   # only device_id’s that are in both v_stable and v_beta will be in the merged data frame   merged <- merge(v_beta, v_stable, by="device_id", suffixes = c(".beta",".stable"))  
The code runs two queries on the view, one for the Beta and one for the Stable version; replaces any null's with 1's (so we can log transform the data, more on that later); and then joins the two result sets so that only devices that were on both the the Stable and the Beta version are included in the merged data frame.

Detecting a bad versions of code using the Student t-test and the Wilcoxon Signed Rank test

We want to detect if a Beta release is 'bad' so that we can stop the roll-out of this new, broken code as early as possible. At the moment we have defined a release as 'bad' if, on average, it logs more errors than the current Stable release.

In the image below, two versions (the blue and green version) are presumed to be of similar quality because the devices on those versions have a similar number of errors. (Each square represents a device, and the x-axis is the number of errors a device had while on that version.)

The devices on the red version have a lot more errors and is assumed to mean that there is a problem with that version.

To test this statistically we use Student's paired sample t-test because we are comparing two versions of code, version_old and version_new, and the same devices are in both version_old and version_new groups because each device was on version_old and then a few days later was upgraded to version_new. This builds two datasets but with the same population (devices) in both samples.

The hypothesis that we are testing is:
• H0: The null hypothesis: "There is no difference between the distributions of the error rates of the two versions."
• Ha: The alternative hypothesis: "The new version has a different error rate distribution from the old version." (Note, the error rates could be better or worse, thus the two sided test.)

1. That more ERROR's are logged on poor releases than on good releases.
2. That the ratio of ERROR to DEBUG log lines (the E/D ratio) is higher for a poor release (a release that has an problem) than for good release.
3. That the E/D ratio for all devices on a version is similar. Or at least are normally distributed.
4. That the only reason the E/D ratio changes is due to code changes. Nothing external influences the number of ERROR's. (We know this is false because things like poor connectivity, flat battery, etc. cause errors to be logged, but hopefully there are relatively few of these outliers.)
5. That the mean of the E/D ratio changes significantly with a poor release and does not change significantly between releases of similar quality.
6. That running a paired t-test between two versions will detect changes in the sample mean of the E/D ratios.

But first we need to make some data transformations to make the data ‘better behaved’ so that our tests stand a better chance of finding a signal in the noise.

Change from absolute number of errors to an error rate to account for different usage patterns

If you’ve been paying attention you would’ve noticed that I changed from talking about errors to talking about error rate. This is because we need to normalise the absolute number of errors to account for the amount of usage a terminal gets. Terminals that are used a lot tend to have more errors caused by the environment, such as poor GSM and GPRS connectivity, running out of paper, flat batteries causing subsystems to shut down, and many other problems that happen in the field that aren’t related to firmware but should be logged, anyway.

By dividing the number of ERROR lines logged by the number of DEBUG lines logged and multiplying by 1 million, we get the number of errors per million debug lines, a number that allows comparison between devices whether they’ve been running on a version for one hour or one week.

Log-Normalise to pull in the outliers for the t-test

Since we have relatively few Beta testers I’ve decided not to remove outliers and throw away data. But this means that there’s a long tail with some distant outliers, and so the data is no longer normally distributed which violates one of the assumptions for the t-test.

To pull in this tail, a common transformation is to log transform the data. [1, 2] In the image below you can clearly see how the log transform made the distribution more normal.

I've chosen log base 10 (as opposed to the natural log, or base 2) so that each unit on the x-axis is an order of magnitude increase in the error rate.

You’ll see in the code snippet above that I set devices that had 0 errors (NA’s or nulls) to have 1 error so that taking the log still works. (And 0 errors per million debugs vs. 1 error per million debugs is equal as far as I'm concerned.)

Running the t-test in R

 > t.test(log10(merged$error_rate.beta), log10(merged$error_rate.stable), paired=TRUE)    Paired t-testdata:  log10(merged$error_rate.beta) and log10(merged$error_rate.stable)t = -1.7194, df = 13, p-value = 0.1092alternative hypothesis: true difference in means is not equal to 095 percent confidence interval: -0.9404635  0.1068907sample estimates:mean of the differences              -0.4167864 Geometric mean of the differences (base 10): 0.3830131 
(Note that because of the log transform the mean as quoted in the test is actually the geometric mean and not the arithmetic mean. [1])

Using the Wilcoxon Signed-Rank test

Probably more correctly, I run the Wilcoxon Signed-Rank test because, unlike the t-test it does not assume a normal distribution of error rates, which makes it more robust against outliers. For this test, then, I don’t need to log normalise the data. This also makes interpretation of the results a little easier.
 > wilcox.test(merged$error_rate.beta, merged$error_rate.stable, paired=TRUE, conf.int=TRUE)      Wilcoxon signed rank testdata:  merged$error_rate.beta and merged$error_rate.stableV = 30, p-value = 0.1726alternative hypothesis: true location shift is not equal to 095 percent confidence interval: -505.36438   55.72701sample estimates:(pseudo)median      -121.2072
The median is the change in the error rate between the two versions. Here, -121 means that the new, Beta version has 121 fewer errors per million debug lines than the current Stable version. And we can be 95% confident that the number of errors per million debug lines in the Beta version is somewhere between 505 fewer and 56 more than in the Stable version.

Because the p-value in both our tests is larger than the 0.05 significance level we cannot reject the null hypothesis, and have to accept that "There is no difference between the distributions of the error rates of the two versions” and therefore the code quality between versions is similar.

Based on this we'd promote the Beta code to Stable.

Note that we cannot say whether the code is truly good or bad, only that if we were happy with the old Stable version then we should be happy with the new Beta version.

Displaying the data

Perhaps easier to interpret than the statistical tests is the boxplot and stripchart. Each coloured square in the strip chart represents the error rate for a single device. The box plot shows the min and max, the first and third quartiles and the dark strip is the median. It also shows any outliers as circles. (The blue version has two outliers on the right.) I’ve vertically separated the plots for readability.

Here you can see that the strip chart as well as the box plots largely overlap each other in both versions, so both versions have similar error rates, visually backing up the statistical tests above.

Ok, so does this actually catch problems?

A few months ago we completed a major code refactor in one of our subsystems. We started rolling out the new Beta firmware, and while most of the devices behaved normally after the upgrade, on some devices the code broke because it didn't correctly deal with a corner case that only showed itself in the field, and then only on a few devices.

The errors were logged by the devices, uploaded to the server and detected when we ran these tests. The devices that exhibited the bug are clearly visible in the chart below clustered on the right as the red outliers.

Both the t-test and Wilcoxon test had p-value's below the 0.05 significance level, giving strong evidence that there was a difference between the versions.

 > t.test(log10(merged$error_rate.beta), log10(merged$error_rate.stable), paired=TRUE)        Paired t-testdata:  log10(merged$error_rate.beta) and log10(merged$error_rate.stable)t = 2.8624, df = 28, p-value = 0.007872alternative hypothesis: true difference in means is not equal to 095 percent confidence interval:0.1131325 0.6825117sample estimates:mean of the differences              0.3978221Geometric mean of the differences (base 10): 2.499321

 > wilcox.test(merged$error_rate.beta, merged$error_rate.stable, paired=TRUE, conf.int=TRUE)          Wilcoxon signed rank testdata:  merged$error_rate.beta and merged$error_rate.stableV = 318, p-value = 0.02906alternative hypothesis: true location shift is not equal to 095 percent confidence interval:     7.075123 14546.761849sample estimates:(pseudo)median      183.0312 

Based on this evidence we stopped the code roll-out. With the help of the [E].filename.linenumber couters we were able to determine what caused the problem, wrote a unit test that recreated the bug, fixed the code and rolled out the new version.

The future

There are many things that I'd like to do, mostly around automating the detection process more and presenting the devops team with a nice overview dashboard of how a roll-out of new firmware is going.

For this I'll probably run OpenCPU, a RESTful API for R, on an instance of Compute Engine so that R functions can be called by our production app running on AppEngine.

Some other ideas:
1. Run tests automatically every hour and flag outliers for investigation by support staff.
2. Automatically pause a roll-out if the statistical tests detect a significant change.
3. Monitor [E].filename.linenumber's that are strong indications of specific types of problems, not just the global ERROR count.
4. As we accelerate the number of releases to more than one firmware release per day we will need to increase the number of beta testers so that we can get test results sooner after beginning a roll-out.

References

The internet is awash with information on learning R and statistics in general. One of the best statistical textbooks I've ever read is Learning Statistics with R: A tutorial for psychology students and other beginners which the author has made available as a free pdf download. It's my go-to textbook every time I get myself confused.