Saturday, October 4, 2014

Monitoring distributed systems with Google Big Query and R



In this blog post I’m going to explain how we use Google’s Big Query for storing and mining log data, and then use the open source statistical programming language, R, to run statistical tests to determine if new code is behaving correctly.

At Nomanini we believe that Continuous Delivery is a key business differentiator that, over the past three years, has allowed us to catch up and overtake some of our more established competitors.

This year we have increased our delivery rate from 10 production releases per month in January to over 50 releases in September, without increasing our engineering head count. We’ve also scaled from a few hundred devices to almost a thousand. Manually upgrading and monitoring each device is now impossible.

Even with thousands of unit and automated acceptance tests running in CI there was always some apprehension when it came to rolling out new code to the point of sale devices in the field. Broken code is going to prevent sales or worse, turn a device in to an expensive brick.


Rolling out firmware to devices in the field

After a new version of firmware runs the gauntlet of automated tests, it is promoted from CI to Alpha then to Beta status. Our users and their devices are by default subscribed to our Stable upgrade channel, but we’ve convinced a few of them to be Beta testers and their terminals are subscribed to the Beta upgrade channel. (Thanks, Chromium Project for this great idea.) As soon as code is promoted to Beta, devices on the Beta channel begin upgrading to the new firmware.

To limit any potential impact we rate-limit our roll-out’s by only upgrading one device every two minutes. Once a device has been told to upgrade, it logs in to the server, downloads the new binaries, and when the merchant isn’t using the terminal, seamlessly boots up the new version of firmware.

Here we have two versions of Master code (our internal name for firmware), one in Beta and the older version in Stable.




How can we be sure that the new Beta version of firmware is not broken in some subtle way that wasn’t detected during automated testing? And how do we know when it’s okay to promote a Beta version to Stable so that all terminals receive the upgrade?

The process that we use is:
  1. Monitor metrics on both the Stable and Beta embedded firmware while running in the field.
  2. Upload the metrics from the devices to our server (Google App Engine) and save to a database. (Google Big Query)
  3. Extract the metrics in to a statistical analysis tool. (R or RStudio)
  4. Run statistical tests to determine if there is any difference between the Stable and Beta releases.
  5. If there is no difference then there is no reason not to promote Beta code to Stable.


In-field monitoring

As with most things embedded, in-the-field monitoring is surprisingly challenging. Terminals are battery operated and use the often unreliable GSM/GPRS network to connect back to our servers. We only have a few hundred kB of flash available for logs. And in some countries data over GSM costs more than $1 per MB, sometimes as much as $10 per MB. So we can’t just upload raw log files or fire off UDP packets like you might do in a data centre.

Apple’s iPad and iPhone usage statistics (Settings - General - About - Diagnostics & Usage - Diagnostics & Usage Data) aggregate key:value metrics over a full day. It looks like they have histogram type counts (like backlight brightness, with 10% buckets); and counters for number of events, seconds, power in mA.

An excerpt from one of my daily iPad log files:
 com.apple.power.backlight.values.000_010  
 5387  
 com.apple.power.backlight.values.010_020  
 162  
 com.apple.power.backlight.values.020_030  
 11898  
 com.apple.power.backlight.values.030_040  
 52714  
 com.apple.power.backlight.values.040_050  
 0  
 com.apple.power.backlight.values.050_060  
 0  
 com.apple.power.backlight.values.060_070  
 0  
 com.apple.power.backlight.values.070_080  
 0  
 com.apple.power.backlight.values.080_090  
 0  
 com.apple.power.backlight.values.090_100  
 0  
   
 com.apple.power.state.charging.count  
 1  
 com.apple.power.state.charging.duration  
 18898  
 com.apple.power.state.charging.energy  
 -20573  
 com.apple.power.state.overall.average_length  
 33698  
 com.apple.power.state.overall.average_power  
 49  
Taking that as inspiration, we wrote a small C++ library that can be called by application code that wants to be instrumented. Counts are saved in a C++ key:value map. At midnight UTC the map is serialised to JSON and uploaded to our server.

The JSON data packet looks like this:
{
  "counts": {
    "ERROR": 7,
    "WARNING": 1475,
    "INFO": 19622,
    "DEBUG": 362754,
    "[E].EventsManager.423": 2,
    "[E].GPSManager.259": 1,
    "[E].SlaveCommsDispatcher.158": 2,
    "[E].SlaveCommsDispatcher.311": 1,
    "[E].SlaveCommsDispatcher.395": 1,
    "CSQ.BitErrors.0": 42,
    "CSQ.BitErrors.1": 1,
    "CSQ.BitErrors.3": 2,
    "CSQ.BitErrors.5": 1,
    "CSQ.SignalStrength.6-11": 18,
    "CSQ.SignalStrength.12-17": 12,
    "CSQ.SignalStrength.18-23": 15,
    "CSQ.SignalStrength.24-29": 1,
    "GPRS.TimeToConnect.0-20": 2
  },
  "firmwareVersion": "4264-548923b591c6",
  "startTime": "2014-09-22 00:00:01.152",
  "endTime": "2014-09-23 00:00:06.574"
}
There are several things going on here:

The ERROR, WARNING, INFO and DEBUG counts are the total number of debug lines that have been logged by the terminal while the code is running.

For each ERROR or CRITICAL line that is logged the library makes a key in the format [loglevel].filename.linenumber, such as [E].GPSManager.259 and increments the count for that key. We also increment the global ERROR counter.

Logging the filename and line number tells us the region of code that is causing errors. Even without a stack trace we have a good idea what caused the problem. Also, this is not meant to detect detailed errors on particular devices, but rather detect similar errors across many devices so that we can detect buggy versions.

We also use the logging library to build up interesting histograms: An example is the CSQ, or signal strength, a value that ranges from 0 (0%) to 32 (100%). Each time we read the CSQ we increment the correct bucket. This is used for testing whether changes in antenna placement on different hardware revisions has improvement signal quality across those devices.

Histograms are also used for timing events: GPRS.TimeToConnect.0-20: 2 means that it took between 0 and 20 seconds both times that the modem connected to the GPRS network. Since there are no other buckets it is implied that all GPRS connections were shorter than 20s.

By default, at midnight (UTC) devices send up their counts in a JSON packet and the counters are reset back to 0. The server streams this data to a table in Google Big Query along with the unique ID of the device that it came from.

To detect problems earlier, Beta devices upload their statistics more often (every 2, 4, 8 or 12 hours) so that we get diagnostics sooner after a new Beta release starts rolling out.


Streaming diagnostic data to Google Big Query

The terminal uploads the JSON diagnostic packet to our application running on Google App Engine during the course of its normal connections to our server. Once the data arrives on the server, it is packaged in to a push task that connects to Google Big Query’s streaming insert API, where it is inserted in to our diagnostics event_log table.




Creating a View in Big Query

Because all diagnostics events, not just the counters are streamed in to this table, I created a View to pull out only the diagnostic counter data, simplifying downstream processing.

To create a View through the Big Query web UI, compose a query as usual and then click Save View. It will ask for the Project, Dataset to save the view in, and the Table ID, which is the table name of the view. Once saved, the view can be queried as any other table query.

I used the following SQL to create my view:
 SELECT  
   JSON_EXTRACT_SCALAR(event_data, '$.firmwareVersion') AS firmware_version,
   device_id,  
   SUM(INTEGER(JSON_EXTRACT_SCALAR(event_data, '$.counts.CRITICAL'))) AS critcals,  
   SUM(INTEGER(JSON_EXTRACT_SCALAR(event_data, '$.counts.ERROR'))) AS errors,  
   SUM(INTEGER(JSON_EXTRACT_SCALAR(event_data, '$.counts.DEBUG'))) AS debugs,  
   1e6 * SUM(INTEGER(JSON_EXTRACT_SCALAR(event_data,'$.counts.ERROR'))) / SUM(INTEGER(JSON_EXTRACT_SCALAR(event_data, '$.counts.DEBUG'))) AS error_rate  
 FROM [nomanini.event_log]  
 WHERE event = 'Counters'  
 GROUP BY firmware_version, device_id  
 ORDER BY firmware_version DESC, device_id DESC;  
In the event_log table, the counters were stored as a JSON object in the event_data column. Another great feature of Big Query SQL is JSON_EXTRACT_SCALAR that allows you to extract data out of a JSON object by specifying the key path.

From the event_log table I've got the device_id, a unique, per terminal identifier, and from the JSON in the event_data column I've pulled out the firmware_version, critical, error and debug counts.

I summed the counts using GROUP BY firmware_version, device_id so I get the total number of errors that each device produced on each version of firmware. I've also calculate the error_rate, which is defined as the number of errors per million debug lines, or error/debug * 1e6. (More on why, later.)

Querying this view:
 SELECT * FROM [nomanini.firmware_error_rates] LIMIT 10;  
Gives:


Getting the data from Google Big Query in to R

There’s a fantastic bigrquery package in R's extensive CRAN repository which allows you to run SQL queries from within R on Big Query and get the results in an R data frame.

R will start the OAuth process and open your browser for authentication. The query runs and the results are saved in to a result data frame.

To query the firmware_error_rates view in Big Query from within R, I run the following code:
 library(bigrquery)  
   
 # The two versions that we want to compare: Beta and Stable   
 firmware_version_beta = "4264-548923b591c6"  
 firmware_version_stable = "4252-c2b7961f0a5b"  
   
 # The base SQL statement  
 sql_query_a = "SELECT firmware_version, device_id, critcals, errors, debugs, error_rate FROM [nomanini-dashboard:nomanini.firmware_error_rates] WHERE firmware_version = '"  
 sql_query_b = "' ORDER BY device_id LIMIT 100;"  
   
 # Create the SQL strings, concatenate using paste  
 sql_beta <- paste(sql_query_a, firmware_version_beta, sql_query_b, sep="")  
 sql_stable <- paste(sql_query_a, firmware_version_stable, sql_query_b, sep="")  
   
 # Run the queries for both the old and new versions  
 v_beta <- query_exec("nomanini-dashboard", "nomanini", sql_beta)  
 v_stable <- query_exec("nomanini-dashboard", "nomanini", sql_stable)  
   
 # Fill NA's (null's) with 1 so can log transform the data  
 v_beta[is.na(v_beta)] <- 1  
 v_stable[is.na(v_stable)] <- 1  
   
 # Join two data frames by the common key variable device_id  
 # only device_id’s that are in both v_stable and v_beta will be in the merged data frame  
 merged <- merge(v_beta, v_stable, by="device_id", suffixes = c(".beta",".stable"))  
The code runs two queries on the view, one for the Beta and one for the Stable version; replaces any null's with 1's (so we can log transform the data, more on that later); and then joins the two result sets so that only devices that were on both the the Stable and the Beta version are included in the merged data frame.

Detecting a bad versions of code using the Student t-test and the Wilcoxon Signed Rank test

We want to detect if a Beta release is 'bad' so that we can stop the roll-out of this new, broken code as early as possible. At the moment we have defined a release as 'bad' if, on average, it logs more errors than the current Stable release.

In the image below, two versions (the blue and green version) are presumed to be of similar quality because the devices on those versions have a similar number of errors. (Each square represents a device, and the x-axis is the number of errors a device had while on that version.)

The devices on the red version have a lot more errors and is assumed to mean that there is a problem with that version.


To test this statistically we use Student's paired sample t-test because we are comparing two versions of code, version_old and version_new, and the same devices are in both version_old and version_new groups because each device was on version_old and then a few days later was upgraded to version_new. This builds two datasets but with the same population (devices) in both samples.

The hypothesis that we are testing is:
  • H0: The null hypothesis: "There is no difference between the distributions of the error rates of the two versions."
  • Ha: The alternative hypothesis: "The new version has a different error rate distribution from the old version." (Note, the error rates could be better or worse, thus the two sided test.) 

I’ve made some assumptions here:

  1. That more ERROR's are logged on poor releases than on good releases.
  2. That the ratio of ERROR to DEBUG log lines (the E/D ratio) is higher for a poor release (a release that has an problem) than for good release.
  3. That the E/D ratio for all devices on a version is similar. Or at least are normally distributed.
  4. That the only reason the E/D ratio changes is due to code changes. Nothing external influences the number of ERROR's. (We know this is false because things like poor connectivity, flat battery, etc. cause errors to be logged, but hopefully there are relatively few of these outliers.)
  5. That the mean of the E/D ratio changes significantly with a poor release and does not change significantly between releases of similar quality.
  6. That running a paired t-test between two versions will detect changes in the sample mean of the E/D ratios.


But first we need to make some data transformations to make the data ‘better behaved’ so that our tests stand a better chance of finding a signal in the noise.


Change from absolute number of errors to an error rate to account for different usage patterns

If you’ve been paying attention you would’ve noticed that I changed from talking about errors to talking about error rate. This is because we need to normalise the absolute number of errors to account for the amount of usage a terminal gets. Terminals that are used a lot tend to have more errors caused by the environment, such as poor GSM and GPRS connectivity, running out of paper, flat batteries causing subsystems to shut down, and many other problems that happen in the field that aren’t related to firmware but should be logged, anyway.

By dividing the number of ERROR lines logged by the number of DEBUG lines logged and multiplying by 1 million, we get the number of errors per million debug lines, a number that allows comparison between devices whether they’ve been running on a version for one hour or one week.


Log-Normalise to pull in the outliers for the t-test

Since we have relatively few Beta testers I’ve decided not to remove outliers and throw away data. But this means that there’s a long tail with some distant outliers, and so the data is no longer normally distributed which violates one of the assumptions for the t-test.




To pull in this tail, a common transformation is to log transform the data. [1, 2] In the image below you can clearly see how the log transform made the distribution more normal.




I've chosen log base 10 (as opposed to the natural log, or base 2) so that each unit on the x-axis is an order of magnitude increase in the error rate.

You’ll see in the code snippet above that I set devices that had 0 errors (NA’s or nulls) to have 1 error so that taking the log still works. (And 0 errors per million debugs vs. 1 error per million debugs is equal as far as I'm concerned.)



Running the t-test in R

 > t.test(log10(merged$error_rate.beta), log10(merged$error_rate.stable), paired=TRUE)
   
 Paired t-test

data:  log10(merged$error_rate.beta) and log10(merged$error_rate.stable)
t = -1.7194, df = 13, p-value = 0.1092
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.9404635  0.1068907
sample estimates:
mean of the differences 
             -0.4167864 

Geometric mean of the differences (base 10): 0.3830131 
(Note that because of the log transform the mean as quoted in the test is actually the geometric mean and not the arithmetic mean. [1])


Using the Wilcoxon Signed-Rank test

Probably more correctly, I run the Wilcoxon Signed-Rank test because, unlike the t-test it does not assume a normal distribution of error rates, which makes it more robust against outliers. For this test, then, I don’t need to log normalise the data. This also makes interpretation of the results a little easier.
 > wilcox.test(merged$error_rate.beta, merged$error_rate.stable, paired=TRUE, conf.int=TRUE)  
   
 Wilcoxon signed rank test

data:  merged$error_rate.beta and merged$error_rate.stable
V = 30, p-value = 0.1726
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
 -505.36438   55.72701
sample estimates:
(pseudo)median 
     -121.2072
The median is the change in the error rate between the two versions. Here, -121 means that the new, Beta version has 121 fewer errors per million debug lines than the current Stable version. And we can be 95% confident that the number of errors per million debug lines in the Beta version is somewhere between 505 fewer and 56 more than in the Stable version.

Because the p-value in both our tests is larger than the 0.05 significance level we cannot reject the null hypothesis, and have to accept that "There is no difference between the distributions of the error rates of the two versions” and therefore the code quality between versions is similar.

Based on this we'd promote the Beta code to Stable.

Note that we cannot say whether the code is truly good or bad, only that if we were happy with the old Stable version then we should be happy with the new Beta version.


Displaying the data

Perhaps easier to interpret than the statistical tests is the boxplot and stripchart. Each coloured square in the strip chart represents the error rate for a single device. The box plot shows the min and max, the first and third quartiles and the dark strip is the median. It also shows any outliers as circles. (The blue version has two outliers on the right.) I’ve vertically separated the plots for readability.




Here you can see that the strip chart as well as the box plots largely overlap each other in both versions, so both versions have similar error rates, visually backing up the statistical tests above.


Ok, so does this actually catch problems?

A few months ago we completed a major code refactor in one of our subsystems. We started rolling out the new Beta firmware, and while most of the devices behaved normally after the upgrade, on some devices the code broke because it didn't correctly deal with a corner case that only showed itself in the field, and then only on a few devices.

The errors were logged by the devices, uploaded to the server and detected when we ran these tests. The devices that exhibited the bug are clearly visible in the chart below clustered on the right as the red outliers.




Both the t-test and Wilcoxon test had p-value's below the 0.05 significance level, giving strong evidence that there was a difference between the versions.

 > t.test(log10(merged$error_rate.beta), log10(merged$error_rate.stable), paired=TRUE)
   
     Paired t-test

data:  log10(merged$error_rate.beta) and log10(merged$error_rate.stable)
t = 2.8624, df = 28, p-value = 0.007872
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.1131325 0.6825117
sample estimates:
mean of the differences
              0.3978221

Geometric mean of the differences (base 10): 2.499321

 > wilcox.test(merged$error_rate.beta, merged$error_rate.stable, paired=TRUE, conf.int=TRUE)  
   
     Wilcoxon signed rank test

data:  merged$error_rate.beta and merged$error_rate.stable
V = 318, p-value = 0.02906
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
     7.075123 14546.761849
sample estimates:
(pseudo)median
      183.0312 

Based on this evidence we stopped the code roll-out. With the help of the [E].filename.linenumber couters we were able to determine what caused the problem, wrote a unit test that recreated the bug, fixed the code and rolled out the new version.

The future

There are many things that I'd like to do, mostly around automating the detection process more and presenting the devops team with a nice overview dashboard of how a roll-out of new firmware is going. 

For this I'll probably run OpenCPU, a RESTful API for R, on an instance of Compute Engine so that R functions can be called by our production app running on AppEngine.

Some other ideas:
  1. Run tests automatically every hour and flag outliers for investigation by support staff.
  2. Automatically pause a roll-out if the statistical tests detect a significant change.
  3. Monitor [E].filename.linenumber's that are strong indications of specific types of problems, not just the global ERROR count.
  4. As we accelerate the number of releases to more than one firmware release per day we will need to increase the number of beta testers so that we can get test results sooner after beginning a roll-out.

References

The internet is awash with information on learning R and statistics in general. One of the best statistical textbooks I've ever read is Learning Statistics with R: A tutorial for psychology students and other beginners which the author has made available as a free pdf download. It's my go-to textbook every time I get myself confused.

Sunday, May 11, 2014

Parking at Cape Town International Airport: Cost Calculator

Parking at airports is expensive and Cape Town International is no different. Every time I return from a trip I'm surprised how much my parking cost. Sometimes it's expensive than the flight.

I made a quick chart of parking price vs days parked for various options, including getting an Uber Taxi or MyCiti bus from Cape Town CBD to the airport.

Cape Town International Airport parking cost chart

While MyCiti bus is the cheapest I want a door-to-door service (either my own car or a taxi.) But if you're on a budget then MyCiti is the clear winner (about R50.00 each way.)

Results


  • Shaded parking in P3, P4 is cheapest up to 5 days.
  • Long Stay parking in P5 is cheapest up to 12 days.
  • And then Uber to the airport and back is the best option.
Interestingly, the Katanga Valet parking (which includes a wash, vacuum, polish) is not much more expensive than the Garage parking (in P1, P2.) Having used them before, I'd recommend it - especially if you are late and just want to dump your car and run to the check-in.

References


Wednesday, May 7, 2014

Interns: How to improve your CV

tl;dr: Make your CV's easy to parse and ensure it stands out from all the other CV's that I receive. I am looking for passion and a history of self-directed learning.

Last Saturday I attended #breaktherules hosted by the University of Cape Town Developer Society to help place Comp. Sci. student interns at local companies.

Even though Nomanini is a small company (compared to the likes of Amazon and Oracle who were also there) we received well over 100 CV's and Gerrit and I spoke to more than 50 hopeful students.

Most of the CV's that we received were useless as a recruiting tool.

In my follow-up emails to everyone who gave me their CV I included some pointers, which I thought I would expand here.

My biggest piece of advice: Make your CV's easy to parse and ensure it stands out from all the other CV's that I receive. I am looking for passion and a history of self-directed learning.

Things which help are:

  • Clearly show your degree and major and your current year of study. (Don't make me work it out based on when you started, or when you hope to graduate.)
  • Make it easy to find your name, email address and cell number. Add a photo to help me recall who you are from all the conversations that I had.
  • Write a paragraph on personal projects that you have worked on outside of your course work. Include links to your code and applications.
  • Describe the sort of work that you'd like to do during your internship as well as what you hope to gain from an internship. 


Less important (to me) is

  • listing every course/module that you have done or your high school results. (Highlighting subjects where you got 90%+ is fine, but I really don't care that you got 63% for "Word Power" two years ago.)
  • your references. If I need it I'll ask.


Really useless info is your age, your health status (always "Excellent"), marital status, drivers license.

I also hand out my business card at these events. If you really want to join my company email me afterwards, attach your CV, write a cover letter telling me why you are a perfect fit.

Remember, you are competing with your peers who largely have the same skills and experience that you do. Use any advantage to differentiate yourself.

Sunday, February 2, 2014

8tracks iPod Shuffle


I love 8tracks. And I love listening to my music while running or in the gym. Up to now I've used 8tracks’ fantastic iPhone app while out, and website and iPad app when at home or work.

Streaming music over my phones 3G is expensive, and strapping a phone to my arm is uncomfortable (and frankly looks silly.)

Also, Apple's matchbook sized iPod Shuffle, which holds 2GB of music, is a thing of beauty.

I wanted to marry the two and clearly I'm not the only one thinking along these lines: Check out Alecsandru Grigoriu's fan art product brochure.


This weekend I decided to write a simple 8tracks client in Python that saves the audio files locally for later upload to my Shuffle.

I’m using the 8track API and Python Requests library which vastly simplifies http comms. (I can’t believe I used to use httplib for this stuff…)

The client is it’s own user on 8tracks which happens to follow another user (me) and also just happens to play all the mixes in my iPod Shuffle collection. The client is well behaved: It downloads the tracks only as fast as if it were really playing them and reports the songs as played at the 30s mark, as per the API docs.

For anyone who’s interested I’ve got the code on BitBucket at https://bitbucket.org/dalehumby/8tracks-shuffle/. Feel free to contribute.

Thursday, July 25, 2013

Take me to Funky Town

I don't know why, or who started it but ever since the all-nighters at Regus, Centuary City with 8 developers in a 10 square meter room, Lips Inc's 'Funky Town' has been our theme tune.

We've been threatening for the better part of two years to get our device to play that tune. That story is forever being deprioritised. But no more. Today we have an official command in our software stack called play_funky_town.



It was a quick dev spike to test the piezoelectric buzzer on our next revision of hardware where we changed from a self-driving, through-hole mounted buzzer to one that needs a 3 kHz square wave from the microcontroller. It also works at frequencies from 100 Hz to about 10 kHz, plenty for Funky Town!

And we can do other more serious stuff, like when we loose our device we can find it by triggering the play_funky_town command from the server and it bursts in to life, complete with a disco LED light sequence on the keypad.

The original awesomness that is Funky Town?


Thursday, April 18, 2013

Scaleconf 2013

I gave a talk at Scaleconf about how Nomanini uses Continuous Deployment (and TDD and CI) across our full stack of hardware, embedded firmware and server backend code.





Quick plug: We're hiring. If you're interested in joining our team, let me know.

Monday, December 17, 2012

Almost 10 mechanical museums to visit with an autistic, boy-genius in Cape Town

Close friends of mine have a high-functioning autistic son of 11 who is crazy about anything mechanical. I'd say he was obsessed, but him mom says that obsessed is just a word that lazy people use to describe someone who's dedicated. William's particularly dedicated to Technic Lego and Mechano.

As a kid I remember going on plenty of outings with my parents to museums with a mechanical theme. Even 20 years later I could probably still describe their layout, how they worked and why. There was never any doubt in my mind that I wanted to study mechanical engineering and work on big machines.



Josephine Mill, Newlands

An operating water-wheel powered stone mill. The foul they produce is used to bake fresh bread. Loads of belts, pulleys, gears, shafts and moving parts to keep Will happy.





Steam train, Cape Town to Simons Town

After riding a steam train all I wanted to be 'when I grew up' was a train driver. Little did I know that the romantic age of steam had long passed. To me there is nothing more raw and mechanical than steam powered machines. Babbage had it right when he designed a steam powered computer.

Kleinplasie, Worcester

Just over an hours drive from Cape Town the Worcester Museum has an industrial agricultural display, including my favorite- a donkey powered water pump. (Bonus points: fly in to the Worcester airstrip, visit the museum, have some lunch and fly out.)


Mostert's Mill, de Waal drive

Another stone mill, this one powered by wind and sail. Much simpler than Josephine's mill but no less fascinating to watch. Open and woking on the third Saturday of each month (wind permitting.)

Water tunnel tours under Cape Town, Castle of Good Hope

Okay, so not really mechanical but something I wanted to do, anyway. Old water tunnels that used to carry water from the mountain to feed the Company Gardens. A new perspective of history of Cape Town.



SAS Sommerset, V&A Waterfront

Not currently open to the public, but open during school holidays for children's sleepovers. Once a coal burning ship it was converted to burning oil. Engine room is dingy, cramped, smells of oil and is jam-packed full of pipes and brass fittings- just as it should be.

Fort Wynyard, Green Point

Also closed to the public, but hopefully opening soon. (Best call the Simons Town Museum for details.) I remember being able to sit on and manoeuvre the anti-aircraft guns. My brother, cousins and I used to hand crank the ack-acks raining imagined destruction on Table Bay.

Athlone power station, Athlone

My uncle worked for Koeberg and managed to get his son and I a tour of Paarden Eiland coal power station shortly before it was demolished in the90's to make way for a container depot. The one at Athlone is likely almost identical inside. Wandering amongst the coal conveyors, boilers, turbines and generator halls was a surreal and fascinating experience. (I wonder if any coal power stations up-country are open for tours?)

Do you know about other mechanical themed outings for William? Let me know and we will take him.