Whitebait, Kleftiko, Ekmek Special: 2014

Friday, 19 December 2014

Optimising Squirrel for speed

TL;DR: avoiding slow Squirrel constructs in favour of faster ones, makes this program run four times faster.

Electric Imp’s little embedded WiFi gadgets are programmed in a scripting language called Squirrel. Like Javascript or Python, Squirrel is compiled to bytecode and executed in an interpreter. This brings lots of advantages – notably, that the interpreter acts as a sandbox and allows crashes and other bugs to be debugged remotely – but sometimes, in embedded systems, the performance consequences of that design can be an issue.

So it’s helpful to know how to write Squirrel that runs as fast as possible on the Electric Imp platform. Electric Imp’s own website contains some generic guidance, but it’s helpful to see it applied to a real example: in this case the impoclock, a ring of sixty WS2812B “Neopixels” implementing an analogue clock.

You can follow the story as a Github gist: starting with the baseline, which consists mostly of the stock NeoPixels class from the Electric Imp examples repository. (And perfectly good example code it is too: the optimisation methods in this post will make the code faster, but also less readable and so less suitable to learn from or adapt.) You can also sneak a peek at the finished version as a Github gist.

0. Baseline

As always, there’s no point trying to optimise something you can’t measure: optimisations applied blindly could as likely be unhelpful as helpful. So the first patch does nothing except add code to time one iteration of the main displaytime routine, and also to log the available memory:

@@ -132,6 +132,8 @@ pixel <- NeoPixels(spi, NUM_PIXELS)
 
 server.log(imp.getsoftwareversion() + " / " + imp.getmacaddress())
 
+server.disconnect();
+
 background <- [0,0,0]
 
 function proximity(a, b, limit) {
@@ -177,4 +179,9 @@ function displaytime(){
   imp.wakeup(0.25, displaytime)
 }
 
+local getus = hardware.micros.bindenv(hardware);
+local us = getus();
 displaytime()
+us = getus() - us;
+local freemem = imp.getmemoryfree();
+server.log(us+" "+freemem);

(The version history is also visible as Gist history, each section of this post as a separate commit.) The code disconnects from the server before taking the measurements: this helps with getting consistent results, as no WiFi interrupts can change the timings, nor any network buffers change the memory usage. The baseline measurements are:

Time/μs	Code size	Free memory
49,373	6.58%	60,712b

That’s about 50,000 microseconds, or 50 milliseconds, or a twentieth of a second to update sixty Neopixels: actually plenty for an analogue clock with a one-second resolution, but other Neopixel applications might be more demanding.

1. Global variables

So let’s get amongst it. The main thing about optimising Squirrel, is that string lookups are expensive. In Squirrel, table members are accessed by string lookup – and that includes members of the global table. However, local variables are accessed more rapidly by (integer) stack offset – and that includes locals of the implicit top-level function. That suggests that it would make a difference just to turn some of the global variables into locals:

@@ -128,15 +128,15 @@ const NUM_PIXELS = 60
 
 spi <- hardware.spi257
 spi.configure(MSB_FIRST, 7500)
-pixel <- NeoPixels(spi, NUM_PIXELS)
+local pixel = NeoPixels(spi, NUM_PIXELS)
 
 server.log(imp.getsoftwareversion() + " / " + imp.getmacaddress())
 
 server.disconnect();
 
-background <- [0,0,0]
+local background = [0,0,0];
 
-function proximity(a, b, limit) {
+local function proximity(a, b, limit) {
 //  return a == b
   local diff = math.abs(a-b)
   if (diff > 60-limit) {

And indeed this does make a measurable difference: it’s already almost ten per cent faster.

	Time/μs	Code size	Free memory
Baseline	49,373	6.58%	60,712b
Global variables	44,666	6.57%	60,752b

2. Optimising writePixel

String lookups definitely seem like a profitable area of investigation. So let’s look at all the string lookups done inside the writePixel method: there are lots, both to look up frame and bits in the surrounding class, and to look up the seek and writeblob members of the blob object.

All of these can be replaced by references to local variables. Squirrel allows functions to refer to local variables in enclosing functions: this can be used to collect together a collection of values plus a function, in such a way that the function always gets called with those values available. In this patch, the variables frameSeeker, frameWriter, and refBits are local variables of the NeoPixels constructor: this means that under normal circumstances, they would evaporate at the end of the constructor call. But here, because they’re referenced inside the function assigned to this.pixelWriter, they are “captured” as “outers”, their lifetime continues, and they remain available to all future calls to the pixelWriter function. The whole collection – the function plus its collected-together variables – is called a closure, and the function is described as having closed over the collected-together variables; similar concepts (differing sometimes in details) exist in other dynamic languages, such as Javascript.

This is probably the least idiomatic and most confusing optimisation in the whole post, but it’s an important one and is worth taking the time to understand. Access to local variables, even “outers”, is really fast – and the new pixelWriter, unlike the old writePixel, refers only to local variables.

@@ -51,6 +51,8 @@ class NeoPixels {
  frameSize       = null; // number of pixels per frame
  frame           = null; // a blob to hold the current frame
 
+ pixelWriter     = null;
+
  // _spi - A configured spi (MSB_FIRST, 7.5MHz)
  // _frameSize - Number of Pixels per frame
     
@@ -66,6 +68,18 @@ class NeoPixels {
         
   clearFrame();
   writeFrame();
+
+  local frameSeeker = frame.seek.bindenv(frame);
+  local frameWriter = frame.writeblob.bindenv(frame);
+  local refBits = bits;
+  this.pixelWriter = function(p, color) {
+      frameSeeker(p*BYTESPERPIXEL);
+
+      // red and green are swapped for some reason, so swizzle them back
+      frameWriter(refBits[color[1]]);
+      frameWriter(refBits[color[0]]);
+      frameWriter(refBits[color[2]]);
+  }
  }
     
  // fill the array of representative 1-wire waveforms. 
@@ -99,13 +113,7 @@ class NeoPixels {
  // color is an array of the form [r, g, b]
     
  function writePixel(p, color) {
-  frame.seek(p*BYTESPERPIXEL);
-        
-  // red and green are swapped for some reason, so swizzle them back 
-        
-  frame.writeblob(bits[color[1]]);
-  frame.writeblob(bits[color[0]]);
-  frame.writeblob(bits[color[2]]);    
+     pixelWriter(p, color);
  }
     
  // Clears the frame buffer

The results justify all that effort: the code is now more than 25% faster than when we started. (It uses a bit more memory and code space, but for many purposes that’s a fair exchange.)

	Time/μs	Code size	Free memory
Baseline	49,373	6.58%	60,712b
Global variables	44,666	6.57%	60,752b
Optimising writePixel	36,400	7.06%	60,048b

3. Optimising writeFrame

The same trick applies to the writeFrame method. In fact, the original method can be replaced completely: notice that client code which calls pixel.writeFrame() doesn’t need to change. The syntax for calling class-members which are functions, is the same as for calling methods.

@@ -52,6 +52,7 @@ class NeoPixels {
  frame           = null; // a blob to hold the current frame
 
  pixelWriter     = null;
+ writeFrame      = null;
 
  // _spi - A configured spi (MSB_FIRST, 7.5MHz)
  // _frameSize - Number of Pixels per frame
@@ -61,6 +62,15 @@ class NeoPixels {
   this.frameSize = _frameSize;
   this.frame = blob(frameSize*BYTESPERPIXEL + 1);
   this.frame[frameSize*BYTESPERPIXEL] = 0;
+
+  local spiWriter = _spi.write.bindenv(_spi);
+  local refFrame = frame;
+
+                // writes the frame buffer to the pixel strip
+                // ie - this function changes the pixel strip
+  this.writeFrame = function() {
+      spiWriter(refFrame);
+  }
         
   // prepare the bits array and the clearblob blob
         
@@ -123,13 +133,6 @@ class NeoPixels {
   frame.seek(0);
   for (local p = 0; p < frameSize; p++) frame.writeblob(clearblob);
  }
-    
- // writes the frame buffer to the pixel strip
- // ie - this function changes the pixel strip
-    
- function writeFrame() {
-  spi.write(frame);
- }
 }
 
 const NUM_PIXELS = 60

As the example code only calls writeFrame once per iteration – as opposed to the sixty times it calls writePixel – the results are a lot less dramatic. Still a slight improvement, though. The remaining methods of class NeoPixels aren’t called even once per frame – just at startup – so there’s little point optimising them.

	Time/μs	Code size	Free memory
Baseline	49,373	6.58%	60,712b
Global variables	44,666	6.57%	60,752b
Optimising writePixel	36,400	7.06%	60,048b
Optimising writeFrame	36,396	7.17%	59,908b

4. Optimising Neopixel calls

The faster pixelWriter function sped up writePixels without changing the API to class NeoPixels. But there’s further speed gains to be had by changing the client code to use it directly. The same is true, to a lesser extent, of writeFrame.

No bindenv is needed for pixelWriter, because it wasn’t a member-function to start with – just a plain free-standing function.

Also in this patch, string lookups for the class-member BYTESPERPIXEL are replaced with the use of a Squirrel const. Consts, like locals, are fast. (This part of the patch really belonged in Stage 2, but I was confused by the SHOUTY name into thinking that BYTESPERPIXEL was already a constant.)

The patch follows the convention that constants belonging to a particular class – which can’t, in Squirrel, be declared as consts inside the class – are given names starting with the class’s name. The same could also be done for the other unchanging values, ONE and NONE – but they’re only used at startup, so the update speed wouldn’t be improved.

@@ -1,3 +1,6 @@
+
+const NEOPIXELS_BYTESPERPIXEL = 24;
+
 class NeoPixels {
     
  // Driver for the World Semi WS2812 - aka Adafruit NeoPixel
@@ -33,7 +36,6 @@ class NeoPixels {
     
  ZERO            = 0xC0;
  ONE             = 0xF8;
- BYTESPERPIXEL   = 24;
     
  // when instantiated, the neopixel class will fill this array with blobs to 
  // represent the waveforms to send the numbers 0 to 255. This allows the blobs to be
@@ -60,8 +62,8 @@ class NeoPixels {
  constructor(_spi, _frameSize) {
   this.spi = _spi;
   this.frameSize = _frameSize;
-  this.frame = blob(frameSize*BYTESPERPIXEL + 1);
-  this.frame[frameSize*BYTESPERPIXEL] = 0;
+  this.frame = blob(frameSize*NEOPIXELS_BYTESPERPIXEL + 1);
+  this.frame[frameSize*NEOPIXELS_BYTESPERPIXEL] = 0;
 
   local spiWriter = _spi.write.bindenv(_spi);
   local refFrame = frame;
@@ -83,7 +85,7 @@ class NeoPixels {
   local frameWriter = frame.writeblob.bindenv(frame);
   local refBits = bits;
   this.pixelWriter = function(p, color) {
-      frameSeeker(p*BYTESPERPIXEL);
+      frameSeeker(p*NEOPIXELS_BYTESPERPIXEL);
 
       // red and green are swapped for some reason, so swizzle them back 
       frameWriter(refBits[color[1]]);
@@ -100,7 +102,7 @@ class NeoPixels {
         
   bits = array(256);
   for (local i = 0; i < 256; i++) {
-   local valblob = blob(BYTESPERPIXEL / 3);
+   local valblob = blob(NEOPIXELS_BYTESPERPIXEL / 3);
    valblob.writen((i & 0x80) ? ONE:ZERO,'b');
    valblob.writen((i & 0x40) ? ONE:ZERO,'b');
    valblob.writen((i & 0x20) ? ONE:ZERO,'b');
@@ -113,7 +115,7 @@ class NeoPixels {
   }
         
   // now fill the clearblob
-  for(local j = 0; j < BYTESPERPIXEL; j++) {
+  for(local j = 0; j < NEOPIXELS_BYTESPERPIXEL; j++) {
    clearblob.writen(ZERO, 'b');
   }
  }
@@ -140,6 +142,8 @@ const NUM_PIXELS = 60
 spi <- hardware.spi257
 spi.configure(MSB_FIRST, 7500)
 local pixel = NeoPixels(spi, NUM_PIXELS)
+local pixelWriter = pixel.pixelWriter;
+local frameWriter = pixel.writeFrame;
 
 server.log(imp.getsoftwareversion() + " / " + imp.getmacaddress())
 
@@ -184,9 +188,9 @@ function displaytime(){
       colour[1] = 0
       colour[2] = 100
     }
-    pixel.writePixel(i,colour)
+    pixelWriter(i,colour)
   }  
-  pixel.writeFrame()
+  frameWriter();
   imp.wakeup(0.25, displaytime)
 }

Even this small change makes another 6% difference.

	Time/μs	Code size	Free memory
Baseline	49,373	6.58%	60,712b
Global variables	44,666	6.57%	60,752b
Optimising writePixel	36,400	7.06%	60,048b
Optimising writeFrame	36,396	7.17%	59,908b
Optimising Neopixel calls	34,043	7.15%	60,060b

5. Removing array references

In Squirrel, arrays such as colour are reference-counted, heap-allocated objects. That sounds a bit expensive: perhaps it’d be faster to pass the three elements (red, green and blue) around separately.

(This was also the motivation for making pixelWriter separate from writePixel: this way, the existing API of class NeoPixels isn’t changed.)

@@ -84,13 +84,13 @@ class NeoPixels {
   local frameSeeker = frame.seek.bindenv(frame);
   local frameWriter = frame.writeblob.bindenv(frame);
   local refBits = bits;
-  this.pixelWriter = function(p, color) {
+  this.pixelWriter = function(p, r,g,b) {
       frameSeeker(p*NEOPIXELS_BYTESPERPIXEL);
 
       // red and green are swapped for some reason, so swizzle them back 
-      frameWriter(refBits[color[1]]);
-      frameWriter(refBits[color[0]]);
-      frameWriter(refBits[color[2]]);
+      frameWriter(refBits[g]);
+      frameWriter(refBits[r]);
+      frameWriter(refBits[b]);
   }
  }
     
@@ -125,7 +125,7 @@ class NeoPixels {
  // color is an array of the form [r, g, b]
     
  function writePixel(p, color) {
-     pixelWriter(p, color);
+     pixelWriter(p, color[0], color[1], color[2]);
  }
     
  // Clears the frame buffer
@@ -170,25 +170,27 @@ function displaytime(){
     hourled = hourled - 60
   }
   for(local i = 0; i<NUM_PIXELS; i++){
-    local colour = [background[0], background[1], background[2]]
+    local red = background[0];
+    local green = background[1];
+    local blue = background[2];
     local hourprox = proximity(i, hourled, 3.0)
     if (hourprox != 0) {
-      colour[0] = (100 * hourprox * hourprox).tointeger()
-      colour[1] = (100 * hourprox * hourprox).tointeger()
-      colour[2] = 0
+      red = (100 * hourprox * hourprox).tointeger()
+      green= (100 * hourprox * hourprox).tointeger()
+      blue = 0
     }
     local minprox = proximity(i, now.min, 2.0)
     if (minprox != 0) {
-      colour[0] = 0
-      colour[1] = (100 * minprox * minprox).tointeger()
-      colour[2] = (100 * minprox * minprox).tointeger()
+      red = 0
+      green = (100 * minprox * minprox).tointeger()
+      blue = (100 * minprox * minprox).tointeger()
     }
     if (proximity(i, now.sec, 0)) {
-      colour[0] = 100
-      colour[1] = 0
-      colour[2] = 100
+      red = 100
+      green = 0
+      blue = 100
     }
-    pixelWriter(i,colour)
+    pixelWriter(i,red,green,blue);
   }  
   frameWriter();
   imp.wakeup(0.25, displaytime)

And yes, it turns out heap objects are slower than the alternative.

	Time/μs	Code size	Free memory
Baseline	49,373	6.58%	60,712b
Global variables	44,666	6.57%	60,752b
Optimising writePixel	36,400	7.06%	60,048b
Optimising writeFrame	36,396	7.17%	59,908b
Optimising Neopixel calls	34,043	7.15%	60,060b
Removing array references	31,876	7.13%	60,044b

6. Global functions

Back on the warpath against string lookups. And there are several still to squash: for instance, the call to math.abs looks up the string “math” in the global table to find the maths library, then looks up the string “abs” in that table (well, actually the maths library is a special table-like object) to find the function to call.

This patch also hoists out the lookup of “date” in the global table, “imp” in the global table, and “wakeup” in the imp table – though, as those were only called once per update, the impact from those is less.

The rule of thumb is: any time you see a dot between two names, such as in “math.abs”, then that’s a string lookup. And any time you see a reference to something from the global table, such as the imp object or date function, then that’s a string lookup too. Accesses to local variables – whether those of the current function or “outers” from any function enclosing it – are not string lookups.

@@ -151,9 +151,12 @@ server.disconnect();
 
 local background = [0,0,0];
 
+local mabs = math.abs.bindenv(math);
+local getdate = date;
+local impwaker = imp.wakeup.bindenv(imp);
+
 local function proximity(a, b, limit) {
-//  return a == b
-  local diff = math.abs(a-b)
+  local diff = mabs(a-b);
   if (diff > 60-limit) {
     diff = diff - (60 - limit) + 1
   }
@@ -164,7 +167,7 @@ local function proximity(a, b, limit) {
 }
 
 function displaytime(){
-  local now = date()
+  local now = getdate()
   local hourled = (now.hour + 0) * 5 + (now.min / 12.0).tointeger()
   if (hourled > 59) {
     hourled = hourled - 60
@@ -193,7 +196,7 @@ function displaytime(){
     pixelWriter(i,red,green,blue);
   }  
   frameWriter();
-  imp.wakeup(0.25, displaytime)
+  impwaker(0.25, displaytime)
 }
 
 local getus = hardware.micros.bindenv(hardware);

Because math.abs gets called several times per pixel, the speed-up is quite significant: the time for the entire update is halved. The small increases in code size and memory usage are easily forgiven.

	Time/μs	Code size	Free memory
Baseline	49,373	6.58%	60,712b
Global variables	44,666	6.57%	60,752b
Optimising writePixel	36,400	7.06%	60,048b
Optimising writeFrame	36,396	7.17%	59,908b
Optimising Neopixel calls	34,043	7.15%	60,060b
Removing array references	31,876	7.13%	60,044b
Global functions	15,563	7.31%	59,752b

7. Hoist table lookups

The few remaining string lookups in displaytime are mostly references to members of the table returned by Squirrel’s date() function. As they’re used over and over again inside the loop, it should be faster to do each lookup just once, before entering the loop: in the compiler literature, this is called loop hoisting.

@@ -166,9 +166,13 @@ local function proximity(a, b, limit) {
   return 0
 }
 
-function displaytime(){
+local displaytime;
+displaytime = function(){
   local now = getdate()
-  local hourled = (now.hour + 0) * 5 + (now.min / 12.0).tointeger()
+  local hour = now.hour;
+  local min  = now.min;
+  local sec  = now.sec;
+  local hourled = (hour + 0) * 5 + (min / 12.0).tointeger()
   if (hourled > 59) {
     hourled = hourled - 60
   }
@@ -182,13 +186,13 @@ function displaytime(){
       green= (100 * hourprox * hourprox).tointeger()
       blue = 0
     }
-    local minprox = proximity(i, now.min, 2.0)
+    local minprox = proximity(i, min, 2.0)
     if (minprox != 0) {
       red = 0
       green = (100 * minprox * minprox).tointeger()
       blue = (100 * minprox * minprox).tointeger()
     }
-    if (proximity(i, now.sec, 0)) {
+    if (proximity(i, sec, 0)) {
       red = 100
       green = 0
       blue = 100

And it’s shaved a tiny bit more off the time.

	Time/μs	Code size	Free memory
Baseline	49,373	6.58%	60,712b
Global variables	44,666	6.57%	60,752b
Optimising writePixel	36,400	7.06%	60,048b
Optimising writeFrame	36,396	7.17%	59,908b
Optimising Neopixel calls	34,043	7.15%	60,060b
Removing array references	31,876	7.13%	60,044b
Global functions	15,563	7.31%	59,752b
Hoist table lookups	14,657	7.38%	60,344b

8. Other optimisations

Right, the gloves are off now. Anything goes that will make this code faster, and readability be gosh-darned.

First of all, we can get rid of all the string lookups in the date structure by just using Squirrel’s time() call instead, which is just a few integer divisions away from giving us the right answer. (Or, in my case, the wrong answer – the code below incorporates a bugfix which in the Gist only appears in a later commit.)

Sort-of similarly, it turns out that open-coding the equivalent of the math.abs function is fractionally faster than actually calling it.

Then we can hoist the lookups in the background array out of the loop. These are integer lookups – not the expensive string sort – but every little helps. (And if you’re thinking, “Why not just get rid of background altogether?”, then the answer is that the actual impoclock has an agent.on() method for setting the background colour, which isn’t shown here because it’s not relevant to the optimisations.)

One final micro-optimisation is to observe that a pixel showing the second hand can’t also be showing the minute hand, and one showing the minute hand can’t also be showing the hour hand. So the ifs in the loop can be if..elses, and we can save about a gnat’s crotchet of time by skipping calculations which we know we aren’t going to use.

@@ -151,12 +151,14 @@ server.disconnect();
 
 local background = [0,0,0];
 
-local mabs = math.abs.bindenv(math);
-local getdate = date;
+local gettime = time;
 local impwaker = imp.wakeup.bindenv(imp);
 
 local function proximity(a, b, limit) {
-  local diff = mabs(a-b);
+  local diff = a-b;
+  if (diff < 0) {
+      diff = -diff;
+  }
   if (diff > 60-limit) {
     diff = diff - (60 - limit) + 1
   }
@@ -168,34 +170,32 @@ local function proximity(a, b, limit) {
 
 local displaytime;
 displaytime = function(){
-  local now = getdate()
-  local hour = now.hour;
-  local min  = now.min;
-  local sec  = now.sec;
+  local now = gettime()
+  local hour = (now/3600)%12;
+  local min  = (now/60)%60;
+  local sec  = now%60;
   local hourled = (hour + 0) * 5 + (min / 12.0).tointeger()
   if (hourled > 59) {
     hourled = hourled - 60
   }
+  local bgred = background[0];
+  local bggreen = background[1];
+  local bgblue = background[2];
   for(local i = 0; i<NUM_PIXELS; i++){
-    local red = background[0];
-    local green = background[1];
-    local blue = background[2];
-    local hourprox = proximity(i, hourled, 3.0)
-    if (hourprox != 0) {
-      red = (100 * hourprox * hourprox).tointeger()
-      green= (100 * hourprox * hourprox).tointeger()
-      blue = 0
-    }
-    local minprox = proximity(i, min, 2.0)
-    if (minprox != 0) {
-      red = 0
-      green = (100 * minprox * minprox).tointeger()
-      blue = (100 * minprox * minprox).tointeger()
-    }
+    local red = bgred, green = bggreen, blue = bgblue;
+    local hourprox, minprox;
     if (proximity(i, sec, 0)) {
-      red = 100
-      green = 0
-      blue = 100
+        red = 100
+        green = 0
+        blue = 100
+    } else if ((hourprox = proximity(i, hourled, 3.0)) != 0) {
+        red = (100 * hourprox * hourprox)
+        green= (100 * hourprox * hourprox)
+        blue = 0
+    } else if ((minprox = proximity(i, min, 2.0)) != 0) {
+        red = 0
+        green = (100 * minprox * minprox)
+        blue = (100 * minprox * minprox)
     }
     pixelWriter(i,red,green,blue);
   }

And these final few things have between them eked out another 15% or so.

	Time/μs	Code size	Free memory
Baseline	49,373	6.58%	60,712b
Global variables	44,666	6.57%	60,752b
Optimising writePixel	36,400	7.06%	60,048b
Optimising writeFrame	36,396	7.17%	59,908b
Optimising Neopixel calls	34,043	7.15%	60,060b
Removing array references	31,876	7.13%	60,044b
Global functions	15,563	7.31%	59,752b
Hoist table lookups	14,657	7.38%	60,344b
Other optimisations	12,196	7.44%	60,392b
Total change	-75.3%	+13%	+0.5%

9. So the final scores there

The impoclock update routine which, written straightforwardly, took nearly 50ms, got optimised down to a little over 12ms: four times as fast. Or, put differently, a peak update rate of 20 frames per second became 80 frames per second.

Most, though perhaps not quite all, of these sorts of optimisations can be applied quite widely to different Squirrel programs. And if you’re writing code that might get widely reused, you never know what timing requirements other users of your code will have, so it behoves you to optimise as much as is sensible. (Conversely, it’s not as sensible to optimise every microsecond out of a routine that nobody’s going to be calling all that often.)

The optimisations have come at some cost in readability, as well as in code size and memory usage. There’s not much that can be done about the readability – but optimising Squirrel code for size, as opposed to speed, will be the subject of another post.

Monday, 15 December 2014

Help, my X clipboard is broken

Whenever I drag-select anything in xterm, the selection immediately gets cleared. Similarly in XEmacs, even when using shift and arrow-keys to select, the selection vanishes straightaway.

Bashing all the modifier keys doesn't help. Switching to console (Ctrl-Alt-F2) and back doesn't help. Unplugging and re-plugging the keyboard doesn't help.

The culprit? gitk. For some reason, when you use the "Line diff"/"Markup words" drop-down, it sometimes mucks up the X clipboard so badly that you have to quit and restart gitk to restore sanity.

I actually figured this out once before, but then had forgotten by the time it happened to me again, so this time I thought I'd write it down.

Wednesday, 22 October 2014

Software Practices At Electric Imp

This post is an elaboration of a discussion held at Electric Imp’s headquarters in October 2013, where we ran through a collection of software engineering best-practices that various team members have found useful (whether actually at EI or at previous jobs) for producing robust pieces of software. None of these are rocket science, and none are all that innovative, but having them all written down in one place helps everyone know what to expect – especially new team-members such as contractors.

Coding standards

We have coding-style guides for the main languages we use: Erlang, Javascript and C++. It’s a specific goal that we shouldn’t be able to tell, just by looking at the style of a section of code, which of us wrote it. We should probably have style guides for our other languages too: Squirrel, Objective-C, Java/Dalvik.

Each should resemble the accepted standards (or at least an accepted standard) for the language in question – as opposed to resembling each other. Our Javascript should look like idiomatic Javascript, not like idiomatic C++.

Of course, sometimes this isn’t possible. Both in Javascript and in C++, our code relies on and interacts with a large number of third-party components – which in many cases are themselves written in different styles.

Code reviews

All production code gets reviewed. Where convenient (developers in the same office), it gets reviewed before git push, but sometimes that’s not convenient (remote developers), so review-after-push is the best we can do. But where the change is large enough to have been done on a feature branch, we can get the best of both worlds by pushing that branch to the central server and doing the review before merge to master.

One obvious benefit of code reviews, is that a second pair of eyes checks over the code for things the original developer didn’t spot. (If done thoughtfully, it can be a great learning tool for more junior members of the team.)

But that isn’t actually the main benefit. The best thing about code reviews, is that they spread out knowledge about the codebase among all the team members, helping to avoid the situation where individual developers have their own “fiefdoms”, or code that only they know about. It’s the idea of the “Smallest Bus Queue Accident”, the slightly ghoulish concept of the number of team-members who would need to simultaneously be hit by a bus for the project to be imperilled. If everyone’s an irreplaceable specialist, SBQA=1, and even just one accident (or poaching, or mid-life crisis) will leave you in trouble. If everyone is familiar with everything (at least enough to maintain it), SBQA=n, the team size, and long-term risk to the project is much lower.

The film “Jurassic Park” is, at some level, a commentary on just how badly things can go wrong in a system with an inadequate (absent) culture of code review.

Unit tests

All commits that change production code, should also contain a unit-test for the change. All the unit-tests are run in every build of the codebase (top-level scons invocation). The primary and most obvious reason for this, is that we’ve all written code that flat-out didn’t do what we intended, and having our assumptions checked automatically before git push keeps those misunderstandings out of the master sources. But it turns out that there are also other benefits of cultivating a habit of unit-testing.

For one thing, a healthy collection of unit-tests can act as documentation for the code under test, showing its expected behaviour and suggesting which parts of its behaviour are intended and essential, and which are accidental and possibly unwanted. Unlike many other forms of documentation (including, uh, the form of the blog), documentation-by-unit-test is guaranteed to be up-to-date, as the build breaks any time it doesn’t match the implementation.

Another benefit of unit-testing is that it acts as a forcing-function for modularity: if it’s hard to see how to test your code one part at a time, because you can’t get in between the parts, then that’s a warning sign that the parts are more tightly-coupled is healthy for maintainability.

There is, though, a trade-off there, a judgement call to be made. In C++ there’s a cost associated with virtual functions (which are needed for mocking and for test doubles in general), and there’s also a human cost associated with the added complexity of introducing test doubles for every single class (even if they’re systematically named and located, that’s still doubling the number of classes in any library). But in general the design of the system should offer enough modularity to provide an adequate sprinkling of seams for introducing test doubles, which means that in general extracting an interface solely so that it can be mocked-out, is frowned upon. In Javascript, the performance cost part doesn’t apply (all functions are, in C++ terms, “virtual”, and so all dependencies can be mocked), but it may still be worth choosing carefully how large the “unit” in each unit-test is, with the overarching goal of retaining readability of the codebase as a whole. That is to say, the “unit” might most easily be more than one class, or more than one file.

Judgement is also needed when writing the tests themselves. It’s all too easy to write tests that are either too precise and thus too fragile against correct changes to the code, or too vague and thus capable of missing incorrect changes. One of our developers was unlucky enough to encounter a bogus test in the very first change he made to our C++ codebase: adding some debug logging of the imp’s current list of servers. A unit-test started failing because getServer() was being called “too many” times. But pretty clearly, it wasn’t part of the external contract of the class how many times it called getServer() – the test was too precise. The mantra should be: test the functionality of the unit, not the implementation. If I want to know whether the implementation is the same as it was yesterday, I’ll use diff – I want the tests to tell me whether it’s fulfilling its external contract the same way it was yesterday, following any changes to its implementation. Practice makes perfect, here: the more tests you write, the better you get at judging these things.

There’s a school of thought that says that tests are so important that you should write them first, before the implementation: test-driven development. In our experience that works well for some tasks – self-contained, purely algorithmic ones – and less well for others. For instance, when writing driver code, the largest source of bugs is an incorrect mental model of how the hardware works: but as the same developer writes the tests then the code, it’s too easy for the tests and the code to be consistent with each other but both wrong. Where it does work, though, it works brilliantly: the imp’s memory allocator was written test-first – and, as it’s a well-defined and well-understood problem, writing the tests was relatively easy.

Test-driven development is as much about improving the quality of the tests as it is of the code. If you know (or think you know) that some particular functionality will be needed in the implementation, but the tests you’ve written so far haven’t exercised it yet, it does make you stop and think: how do I write a test for that – how do I get the system into that state? Either it’s forcing you to come up with a test you didn’t realise you needed – or, even better, it’s telling you that your expected elaboration wasn’t needed at all.

All told, I don’t think it’s a huge exaggeration to say, “If you don’t have tests, you don’t really have the code” – because without the tests, you can’t do anything with it: the path to being able to refactor it, is nearly as long as the path to writing it again from scratch.

System tests

Unit tests are good at ensuring that individual modules fulfill their contracts. But the task of writing quality software is, sadly, still not done at that point: assembling these contracts together to form an overall system is still a human process, fertile with opportunities for human error. Unit testing reduces the risk of inter-module bugs, but cannot eliminate it altogether.

So at some point you’re going to have to test larger parts of the system. Michael Feathers’ rules of thumb as to what is a unit test and what isn’t, are widely quoted: the key distinction being, a unit test should be basically instantaneous and invisible, at least when successful. System tests are a bit more heavyweight, and might be more involved to set up: for instance, target hardware might be needed.

In theory, system testing proceeds via structural induction: prove that component A works, then prove that A+B works, then A+B+C, and so on until you’ve tested the whole software system. In practice, it usually suffices to test major components, and then the system as a whole: for instance, the imp hardware, then the imp server, then end-to-end tests of the whole enchilada.

Just as unit tests can be viewed as documentation for the code, system tests can be viewed as high-level specifications of the entire system’s behaviour: “when an imp contacts the server, it’s sent the appropriate Squirrel for the device it’s plugged into”.

Static analysis

The big idea with static analysis (and, for that matter, dynamic analysis), is this: for any property that your code must have in order to be correct, it’s more reliable to have an automated check that your code has that property, than it is to rely on every programmer to manually keep it in mind the whole time. For instance, Javascript: The Good Parts – a book conspicuously much thinner than “Javascript: The Definitive Guide” – is really about avoiding language features that cause misleading or buggy code. Most of the traps the author describes, can be detected by the JSLint or JSHint static-analysis tools; running one of those tools before committing, can thus keep questionable or hard-to-maintain constructs out of the codebase.

In the wild-west days of C, similar coding traps were caught by a program called lint – though when C++ came along,

Lots of warnings:

g++ -Wall -Wextra -Werror -Wundef -Wno-unused-parameter -Woverloaded-virtual -Wlogical-op -fstrict-enums -Wno-long-long -Wpointer-arith -Wnon-virtual-dtor -Wno-sign-conversion -Wunused-but-set-variable

it designed out most of the need for lint, and incorporated much of the rest into the compiler. Some of it is in the compiler only in optional warnings or errors – so we always run GCC in a mode with lots of warnings and errors. (On projects that also use other compilers – MSVC for example – it’s a good idea to turn on all the warnings in all of them, as every compiler detects different problems. But here at EI, all our targets are GCC-based.) We’re happy to keep looking, but we’ve yet to find a lint-like tool for C++ that finds real bugs.

Other languages have their own static-analysis tools: Erlang has Dialyzer, and XCode has some stuff for Objective-C. Though more are always welcome: for instance, code written against node.js makes widespread use of callbacks as part of asynchronous APIs. So it’s a correctness constraint that any routine that takes a parameter called cb, must call it exactly once (not twice, not no times) on every path through the function. That should, in theory, be amenable to automated testing (at least for most easy cases).

Dynamic analysis

Dynamic analysis tools also tend to be language-specific. In C++, the use of Valgrind is well-known; Helgrind, bundled with Valgrind, can be awkward to use but can also be invaluable; and mutrace and perf are helpful at that stage of project maturity when the question becomes, “yes, yes, it works, but why isn’t it faster?” (That’s usually a pretty late stage of maturity, on the timeless basis that it’s easier to optimise correct code, than it is to correct optimised code.)

The one act of dynamic analysis that all types of code can enjoy, though, is code-coverage analysis. Code-coverage analysis is the ugly cousin of test-driven development, but is good enough for a rainy weekend in Norfolk. Developing while keeping an eye on the coverage statistics is almost, but not quite, as reliable as test-driven development as a method of ensuring that your code has good-quality tests.

Continuous integration

We use the very splendid Jenkins continuous-integration server (autobuilder). Its main responsibility is running, after each commit, all the tests and analyses that would be too time-consuming for developers to run before each commit. Or too awkward – Jenkins runs all the system tests that require special hardware, and also runs builds on all supported host platforms.

The quality bar for pushes to master, is clean runs on all of these test suites; any failures are stop-the-line emergencies. If a build or tests is failing, the very next push must be the fix, or other developers can’t continue pushing (because they can’t know whether their own work passes that test or not). This is usually known as “do not commit on red” – although with Git, it’s actually the “push”, not “commit”, operation that’s the relevant one. Because of that, a failing build means nobody else can make progress, so fixing it is viewed as an emergency.

Using Pivotal

We use Pivotal Labs’ Pivotal Tracker to keep track of work items. As Electric Imp is spread out over several time-zones, it’s important to have an online tool that we can use to look at the status of projects or tasks without having to bother actual human beings. We looked at a few online Agile tools, and the simplicity and easy visibility of Pivotal made it definitely the best.

For those who haven’t seen it, it arranges tasks into (basically) three columns: “current”, “backlog”, and “icebox”. “Current” is for the current sprint, “backlog” is for upcoming sprints, and “icebox” for everything else. Tasks can be moved around and re-ordered by drag-and-drop. A task is either a user-story, a “chore” (which we use for refactoring or technical-debt tasks that aren’t end-user-visible), or a bug.

We use this system to work on tasks in priority order, as prioritised by the product owner:

Anyone can add new stories, chores, or bugs to a “new tasks here” section in the icebox.
The product owner has a think about the new task (if it’s a story or a bug), and prioritises it either into the backlog, or into a “nice to haves” section in the icebox.
If it’s a chore, the developers prioritise it between themselves (typically, just before a story which would be aided by the cleanup, or which would need re-doing if done before the cleanup).
As developers finish up previous tasks, they pick the topmost one in the backlog each time. (We don’t really deal in “sprints” like canonical Agile – it’s more task-by-task, like kanban.)

Using Git

Unlike, say, Subversion, Git makes having a neat project history achievable. So we try to achieve it.

Bow-shaped feature branches let us view individual features either entire or as composed of a series of patches. That’s good for cherry-picking, it’s good for (ahem) reverting, and it’s good for rereading old commits to understand the origins of a piece of code (or of a bug). In general we name single-developer feature branches with the developer’s initials plus a very short indicator of the feature, such as pdh-discovery or rs-schema. (The developer’s initials serve as a kind of “watch out, I might rebase this branch at any moment”.)

The server and the client have different needs from their respective release processes, mainly because of “in-the-drawer” syndrome: previous client releases remain important in perpetuity, because a user could stick one in a drawer, forget about it, remember a year later, and then try and connect it to modern-day servers. The same doesn’t apply the other way around: previous server releases are, to a large extent, yesterday’s newspapers.

Ensure it all happens

We’re the engineering department; we’ve all been hired to do engineering. It’s pretty likely that we’re the last line of defence for things being well-engineered and done right: very few customers or project-managers have ever been heard to say, “Yes, yes, but can’t you take a bit longer and do it properly?” – indeed, many have been heard to say the exact opposite.

So for instance, adding automated tests is always an integral part of the development process – whether it’s done strictly before coding the functionality, as the test-driven school would say, or in a more intermingled way. It’s not an optional extra. Developers incorporate the building of tests into the original development estimates, and project owners don’t (shouldn’t) accept a story as “done” unless the testing is in-place too.

Being the engineering department, we don’t necessarily get to decide the balance between engineering goals and commercial goals (in fact, we probably don’t want to do that). But it does behove us to make sure that the people who do make those decisions, know what the engineering consequences will be.

Saturday, 12 April 2014

Is everything in this release pushed upstream?

Engineering is repeatability. If what you’re doing isn’t repeatable, it isn’t engineering, and may in fact just be performance art instead. (Not that there’s anything wrong with performance art per se, but if you bought a tin marked “Engineering”, and instead it contained performance art, you’d take it back to the shop.)

Part of being able to repeat something is knowing what it is that you’re trying to repeat. Or, now that we’re moving past the “fanciful opening paragraph” phase of this blog post and into specifics: part of making a software release is knowing exactly what source code it was built from, so that if there’s ever a problem with the release, the exactly-corresponding source code can be analysed to discover the cause. (Sometimes tools can make this hard; sometimes they also have an option that makes it easier again.)

In the days of CVS and Subversion, when there was a single repository for a project, repeatability was typically achieved by using the tagging mechanism of those version-control systems: anyone could later check out that same tag and obtain code identical to the release. (In the days before CVS and Subversion, we probably just wrote a code snapshot to a CD or a floppy and kept it somewhere safe. We were cowboys once and young.)

But now we live in the days of Git. It’s possible to use Git in a purely peer-to-peer fashion, every developer being their own island – but that could easily cause unrepeatability of releases, releases as performance art. To make releases as acts of engineering, you’re going to have to nominate one repository as the golden or master one, the one where the tags and SHAs of releases live.

The Linux kernel project, for which Git was originally designed, does this by having a release manager (Linus Torvalds himself) whose personal (but centrally-hosted and shared) repository is the golden one, and who makes releases from code he’s pulled from other developers’ repositories as they complete features.

There’s another way of doing it, though: one which is more familiar to developers who have used CVS or Subversion-era tools. And that is to have a central golden repository for the project, which everyone has write access to, and to use git push as if it were svn commit. But it may be inconvenient to build the releases on the machine hosting the Git repository, so release managers will still do that on their own machines in their own local Git repositories.

So, especially if the release manager is also a developer, the engineering question becomes: is this local commit, from which I’m building the release, also in the golden repository? Is my SHA upstream? (Or am I just pleased to see you?) This is an important question because, if you use bow-shaped branches – or any other rebase workflow – then the SHA of your commit may well be different by the time it has gone upstream.

What’s really needed is a short and pithy Git command that looks a bit like “git is-releasable HEAD” – but there isn’t one, so in order to answer the question “Is it OK to build a release from this commit?”, you need to build that facility from several Git commands.

First off, clearly it’s not OK to build the release if you have uncommitted changes:

git status # Sometimes needed to refresh the index (?) git diff-index --quiet HEAD || echo Nope

Now you need to check whether your SHA exists upstream:

test "`git branch -r --contains HEAD`" != "" || echo Nope

That works by listing all the branches in the golden repository which contain your current commit. If there’s at least one such branch, it’s OK to build the release – otherwise, no branch in the golden repository contains your commit, so it isn’t currently repeatable, so it isn’t OK.

Some of this functionality is offered by git describe --dirty, but apart from being less fun than it sounds, it doesn’t answer the question about the golden repository. (Perhaps because, in its original Linux home, releases are always made from the golden repository.)

At Electric Imp, the version information that we incorporate into every build of the software, includes not just the Git SHA but also the “OK to release” information and a corresponding Git tag name (if any). The “OK to release” is marked by having a “+” in the version number when it’s not OK, i.e. when this must be an internal build only. (You can’t always force people to do the Right Thing, but you can at least make sure that they’re told about it when they’re doing the Wrong Thing.) Here’s the relevant part of our top-level SConstruct file:

gitRevision = subprocess.Popen(["git","rev-parse","--short","HEAD"],
                               stdout=subprocess.PIPE
                               ).communicate()[0].strip()
subprocess.check_output(["git", "status"]) # Refresh index ready for diff-index
if subprocess.call(["git", "diff-index", "--quiet", "HEAD"]):
    # Local diffs, give it a plus
    gitRevision += "+"
    print "Local diffs -- NOT FOR PRODUCTION USE"
else:
    gitRemote = subprocess.Popen(["git", "branch", "-r", "--contains", "HEAD"],
                                 stdout=subprocess.PIPE
                                 ).communicate()[0].strip()
    if gitRemote == "":
        # Doesn't exist in upstream branches (i.e., not pushed): give it a plus
        gitRevision += "+"
        print "Ahead of upstream -- NOT FOR PRODUCTION USE"

gitTag=''

if "+" not in gitRevision:
  gitTagPipe = subprocess.Popen(["git", "describe", "--exact-match"],
                            stdout=subprocess.PIPE)
  rawtag = gitTagPipe.communicate()
  if gitTagPipe.returncode:
    print "Not tagged -- NOT FOR EXTERNAL RELEASE"
  else:
    gitTag = " - " + rawtag[0].strip()

ei_version = gitRevision + gitTag + " - " + datetime.datetime.now().strftime("%c")

So a build from a known and stable SHA (i.e., one which exists in the golden repository) gets a version looking like this:

92a5ff6 - Fri Feb 7 18:25:04 2014

whereas a local build with a bunch of stuff I haven’t pushed upstream yet gets a version looking like this:

c47c6da+ - Sat Apr 12 20:51:25 2014

with the tell-tale “+” in the SHA. And a version that’s tagged in Git for external release has the tag added to the version string too:

af0f28a - release-27.10 - Fri Dec 13 11:08:38 2013

and that is, in fact, the current format of the string returned on the imp itself by imp.getsoftwareversion() – though we make deliberately very weak guarantees about the format of the string (it’s human-readable, it’s different for different releases) in case it ever needs to change in the future.

Whitebait, Kleftiko, Ekmek Special