Not in fact any relation to the famous large Greek meal of the same name.

Friday, 19 December 2014

Optimising Squirrel for speed

TL;DR: avoiding slow Squirrel constructs in favour of faster ones, makes this program run four times faster.

Electric Imp’s little embedded WiFi gadgets are programmed in a scripting language called Squirrel. Like Javascript or Python, Squirrel is compiled to bytecode and executed in an interpreter. This brings lots of advantages – notably, that the interpreter acts as a sandbox and allows crashes and other bugs to be debugged remotely – but sometimes, in embedded systems, the performance consequences of that design can be an issue.

So it’s helpful to know how to write Squirrel that runs as fast as possible on the Electric Imp platform. Electric Imp’s own website contains some generic guidance, but it’s helpful to see it applied to a real example: in this case the impoclock, a ring of sixty WS2812B “Neopixels” implementing an analogue clock.

You can follow the story as a Github gist: starting with the baseline, which consists mostly of the stock NeoPixels class from the Electric Imp examples repository. (And perfectly good example code it is too: the optimisation methods in this post will make the code faster, but also less readable and so less suitable to learn from or adapt.) You can also sneak a peek at the finished version as a Github gist.

0. Baseline

As always, there’s no point trying to optimise something you can’t measure: optimisations applied blindly could as likely be unhelpful as helpful. So the first patch does nothing except add code to time one iteration of the main displaytime routine, and also to log the available memory:

@@ -132,6 +132,8 @@ pixel <- NeoPixels(spi, NUM_PIXELS)
 
 server.log(imp.getsoftwareversion() + " / " + imp.getmacaddress())
 
+server.disconnect();
+
 background <- [0,0,0]
 
 function proximity(a, b, limit) {
@@ -177,4 +179,9 @@ function displaytime(){
   imp.wakeup(0.25, displaytime)
 }
 
+local getus = hardware.micros.bindenv(hardware);
+local us = getus();
 displaytime()
+us = getus() - us;
+local freemem = imp.getmemoryfree();
+server.log(us+" "+freemem);

(The version history is also visible as Gist history, each section of this post as a separate commit.) The code disconnects from the server before taking the measurements: this helps with getting consistent results, as no WiFi interrupts can change the timings, nor any network buffers change the memory usage. The baseline measurements are:

Time/μsCode sizeFree memory
49,3736.58%60,712b

That’s about 50,000 microseconds, or 50 milliseconds, or a twentieth of a second to update sixty Neopixels: actually plenty for an analogue clock with a one-second resolution, but other Neopixel applications might be more demanding.

1. Global variables

So let’s get amongst it. The main thing about optimising Squirrel, is that string lookups are expensive. In Squirrel, table members are accessed by string lookup – and that includes members of the global table. However, local variables are accessed more rapidly by (integer) stack offset – and that includes locals of the implicit top-level function. That suggests that it would make a difference just to turn some of the global variables into locals:

@@ -128,15 +128,15 @@ const NUM_PIXELS = 60
 
 spi <- hardware.spi257
 spi.configure(MSB_FIRST, 7500)
-pixel <- NeoPixels(spi, NUM_PIXELS)
+local pixel = NeoPixels(spi, NUM_PIXELS)
 
 server.log(imp.getsoftwareversion() + " / " + imp.getmacaddress())
 
 server.disconnect();
 
-background <- [0,0,0]
+local background = [0,0,0];
 
-function proximity(a, b, limit) {
+local function proximity(a, b, limit) {
 //  return a == b
   local diff = math.abs(a-b)
   if (diff > 60-limit) {

And indeed this does make a measurable difference: it’s already almost ten per cent faster.

Time/μsCode sizeFree memory
Baseline49,3736.58%60,712b
Global variables44,6666.57%60,752b

2. Optimising writePixel

String lookups definitely seem like a profitable area of investigation. So let’s look at all the string lookups done inside the writePixel method: there are lots, both to look up frame and bits in the surrounding class, and to look up the seek and writeblob members of the blob object.

All of these can be replaced by references to local variables. Squirrel allows functions to refer to local variables in enclosing functions: this can be used to collect together a collection of values plus a function, in such a way that the function always gets called with those values available. In this patch, the variables frameSeeker, frameWriter, and refBits are local variables of the NeoPixels constructor: this means that under normal circumstances, they would evaporate at the end of the constructor call. But here, because they’re referenced inside the function assigned to this.pixelWriter, they are “captured” as “outers”, their lifetime continues, and they remain available to all future calls to the pixelWriter function. The whole collection – the function plus its collected-together variables – is called a closure, and the function is described as having closed over the collected-together variables; similar concepts (differing sometimes in details) exist in other dynamic languages, such as Javascript.

This is probably the least idiomatic and most confusing optimisation in the whole post, but it’s an important one and is worth taking the time to understand. Access to local variables, even “outers”, is really fast – and the new pixelWriter, unlike the old writePixel, refers only to local variables.

@@ -51,6 +51,8 @@ class NeoPixels {
  frameSize       = null; // number of pixels per frame
  frame           = null; // a blob to hold the current frame
 
+ pixelWriter     = null;
+
  // _spi - A configured spi (MSB_FIRST, 7.5MHz)
  // _frameSize - Number of Pixels per frame
     
@@ -66,6 +68,18 @@ class NeoPixels {
         
   clearFrame();
   writeFrame();
+
+  local frameSeeker = frame.seek.bindenv(frame);
+  local frameWriter = frame.writeblob.bindenv(frame);
+  local refBits = bits;
+  this.pixelWriter = function(p, color) {
+      frameSeeker(p*BYTESPERPIXEL);
+
+      // red and green are swapped for some reason, so swizzle them back
+      frameWriter(refBits[color[1]]);
+      frameWriter(refBits[color[0]]);
+      frameWriter(refBits[color[2]]);
+  }
  }
     
  // fill the array of representative 1-wire waveforms. 
@@ -99,13 +113,7 @@ class NeoPixels {
  // color is an array of the form [r, g, b]
     
  function writePixel(p, color) {
-  frame.seek(p*BYTESPERPIXEL);
-        
-  // red and green are swapped for some reason, so swizzle them back 
-        
-  frame.writeblob(bits[color[1]]);
-  frame.writeblob(bits[color[0]]);
-  frame.writeblob(bits[color[2]]);    
+     pixelWriter(p, color);
  }
     
  // Clears the frame buffer

The results justify all that effort: the code is now more than 25% faster than when we started. (It uses a bit more memory and code space, but for many purposes that’s a fair exchange.)

Time/μsCode sizeFree memory
Baseline49,3736.58%60,712b
Global variables44,6666.57%60,752b
Optimising writePixel36,4007.06%60,048b

3. Optimising writeFrame

The same trick applies to the writeFrame method. In fact, the original method can be replaced completely: notice that client code which calls pixel.writeFrame() doesn’t need to change. The syntax for calling class-members which are functions, is the same as for calling methods.

@@ -52,6 +52,7 @@ class NeoPixels {
  frame           = null; // a blob to hold the current frame
 
  pixelWriter     = null;
+ writeFrame      = null;
 
  // _spi - A configured spi (MSB_FIRST, 7.5MHz)
  // _frameSize - Number of Pixels per frame
@@ -61,6 +62,15 @@ class NeoPixels {
   this.frameSize = _frameSize;
   this.frame = blob(frameSize*BYTESPERPIXEL + 1);
   this.frame[frameSize*BYTESPERPIXEL] = 0;
+
+  local spiWriter = _spi.write.bindenv(_spi);
+  local refFrame = frame;
+
+                // writes the frame buffer to the pixel strip
+                // ie - this function changes the pixel strip
+  this.writeFrame = function() {
+      spiWriter(refFrame);
+  }
         
   // prepare the bits array and the clearblob blob
         
@@ -123,13 +133,6 @@ class NeoPixels {
   frame.seek(0);
   for (local p = 0; p < frameSize; p++) frame.writeblob(clearblob);
  }
-    
- // writes the frame buffer to the pixel strip
- // ie - this function changes the pixel strip
-    
- function writeFrame() {
-  spi.write(frame);
- }
 }
 
 const NUM_PIXELS = 60

As the example code only calls writeFrame once per iteration – as opposed to the sixty times it calls writePixel – the results are a lot less dramatic. Still a slight improvement, though. The remaining methods of class NeoPixels aren’t called even once per frame – just at startup – so there’s little point optimising them.

Time/μsCode sizeFree memory
Baseline49,3736.58%60,712b
Global variables44,6666.57%60,752b
Optimising writePixel36,4007.06%60,048b
Optimising writeFrame36,3967.17%59,908b

4. Optimising Neopixel calls

The faster pixelWriter function sped up writePixels without changing the API to class NeoPixels. But there’s further speed gains to be had by changing the client code to use it directly. The same is true, to a lesser extent, of writeFrame.

No bindenv is needed for pixelWriter, because it wasn’t a member-function to start with – just a plain free-standing function.

Also in this patch, string lookups for the class-member BYTESPERPIXEL are replaced with the use of a Squirrel const. Consts, like locals, are fast. (This part of the patch really belonged in Stage 2, but I was confused by the SHOUTY name into thinking that BYTESPERPIXEL was already a constant.)

The patch follows the convention that constants belonging to a particular class – which can’t, in Squirrel, be declared as consts inside the class – are given names starting with the class’s name. The same could also be done for the other unchanging values, ONE and NONE – but they’re only used at startup, so the update speed wouldn’t be improved.

@@ -1,3 +1,6 @@
+
+const NEOPIXELS_BYTESPERPIXEL = 24;
+
 class NeoPixels {
     
  // Driver for the World Semi WS2812 - aka Adafruit NeoPixel
@@ -33,7 +36,6 @@ class NeoPixels {
     
  ZERO            = 0xC0;
  ONE             = 0xF8;
- BYTESPERPIXEL   = 24;
     
  // when instantiated, the neopixel class will fill this array with blobs to 
  // represent the waveforms to send the numbers 0 to 255. This allows the blobs to be
@@ -60,8 +62,8 @@ class NeoPixels {
  constructor(_spi, _frameSize) {
   this.spi = _spi;
   this.frameSize = _frameSize;
-  this.frame = blob(frameSize*BYTESPERPIXEL + 1);
-  this.frame[frameSize*BYTESPERPIXEL] = 0;
+  this.frame = blob(frameSize*NEOPIXELS_BYTESPERPIXEL + 1);
+  this.frame[frameSize*NEOPIXELS_BYTESPERPIXEL] = 0;
 
   local spiWriter = _spi.write.bindenv(_spi);
   local refFrame = frame;
@@ -83,7 +85,7 @@ class NeoPixels {
   local frameWriter = frame.writeblob.bindenv(frame);
   local refBits = bits;
   this.pixelWriter = function(p, color) {
-      frameSeeker(p*BYTESPERPIXEL);
+      frameSeeker(p*NEOPIXELS_BYTESPERPIXEL);
 
       // red and green are swapped for some reason, so swizzle them back 
       frameWriter(refBits[color[1]]);
@@ -100,7 +102,7 @@ class NeoPixels {
         
   bits = array(256);
   for (local i = 0; i < 256; i++) {
-   local valblob = blob(BYTESPERPIXEL / 3);
+   local valblob = blob(NEOPIXELS_BYTESPERPIXEL / 3);
    valblob.writen((i & 0x80) ? ONE:ZERO,'b');
    valblob.writen((i & 0x40) ? ONE:ZERO,'b');
    valblob.writen((i & 0x20) ? ONE:ZERO,'b');
@@ -113,7 +115,7 @@ class NeoPixels {
   }
         
   // now fill the clearblob
-  for(local j = 0; j < BYTESPERPIXEL; j++) {
+  for(local j = 0; j < NEOPIXELS_BYTESPERPIXEL; j++) {
    clearblob.writen(ZERO, 'b');
   }
  }
@@ -140,6 +142,8 @@ const NUM_PIXELS = 60
 spi <- hardware.spi257
 spi.configure(MSB_FIRST, 7500)
 local pixel = NeoPixels(spi, NUM_PIXELS)
+local pixelWriter = pixel.pixelWriter;
+local frameWriter = pixel.writeFrame;
 
 server.log(imp.getsoftwareversion() + " / " + imp.getmacaddress())
 
@@ -184,9 +188,9 @@ function displaytime(){
       colour[1] = 0
       colour[2] = 100
     }
-    pixel.writePixel(i,colour)
+    pixelWriter(i,colour)
   }  
-  pixel.writeFrame()
+  frameWriter();
   imp.wakeup(0.25, displaytime)
 }
 

Even this small change makes another 6% difference.

Time/μsCode sizeFree memory
Baseline49,3736.58%60,712b
Global variables44,6666.57%60,752b
Optimising writePixel36,4007.06%60,048b
Optimising writeFrame36,3967.17%59,908b
Optimising Neopixel calls34,0437.15%60,060b

5. Removing array references

In Squirrel, arrays such as colour are reference-counted, heap-allocated objects. That sounds a bit expensive: perhaps it’d be faster to pass the three elements (red, green and blue) around separately.

(This was also the motivation for making pixelWriter separate from writePixel: this way, the existing API of class NeoPixels isn’t changed.)

@@ -84,13 +84,13 @@ class NeoPixels {
   local frameSeeker = frame.seek.bindenv(frame);
   local frameWriter = frame.writeblob.bindenv(frame);
   local refBits = bits;
-  this.pixelWriter = function(p, color) {
+  this.pixelWriter = function(p, r,g,b) {
       frameSeeker(p*NEOPIXELS_BYTESPERPIXEL);
 
       // red and green are swapped for some reason, so swizzle them back 
-      frameWriter(refBits[color[1]]);
-      frameWriter(refBits[color[0]]);
-      frameWriter(refBits[color[2]]);
+      frameWriter(refBits[g]);
+      frameWriter(refBits[r]);
+      frameWriter(refBits[b]);
   }
  }
     
@@ -125,7 +125,7 @@ class NeoPixels {
  // color is an array of the form [r, g, b]
     
  function writePixel(p, color) {
-     pixelWriter(p, color);
+     pixelWriter(p, color[0], color[1], color[2]);
  }
     
  // Clears the frame buffer
@@ -170,25 +170,27 @@ function displaytime(){
     hourled = hourled - 60
   }
   for(local i = 0; i<NUM_PIXELS; i++){
-    local colour = [background[0], background[1], background[2]]
+    local red = background[0];
+    local green = background[1];
+    local blue = background[2];
     local hourprox = proximity(i, hourled, 3.0)
     if (hourprox != 0) {
-      colour[0] = (100 * hourprox * hourprox).tointeger()
-      colour[1] = (100 * hourprox * hourprox).tointeger()
-      colour[2] = 0
+      red = (100 * hourprox * hourprox).tointeger()
+      green= (100 * hourprox * hourprox).tointeger()
+      blue = 0
     }
     local minprox = proximity(i, now.min, 2.0)
     if (minprox != 0) {
-      colour[0] = 0
-      colour[1] = (100 * minprox * minprox).tointeger()
-      colour[2] = (100 * minprox * minprox).tointeger()
+      red = 0
+      green = (100 * minprox * minprox).tointeger()
+      blue = (100 * minprox * minprox).tointeger()
     }
     if (proximity(i, now.sec, 0)) {
-      colour[0] = 100
-      colour[1] = 0
-      colour[2] = 100
+      red = 100
+      green = 0
+      blue = 100
     }
-    pixelWriter(i,colour)
+    pixelWriter(i,red,green,blue);
   }  
   frameWriter();
   imp.wakeup(0.25, displaytime)

And yes, it turns out heap objects are slower than the alternative.

Time/μsCode sizeFree memory
Baseline49,3736.58%60,712b
Global variables44,6666.57%60,752b
Optimising writePixel36,4007.06%60,048b
Optimising writeFrame36,3967.17%59,908b
Optimising Neopixel calls34,0437.15%60,060b
Removing array references31,8767.13%60,044b

6. Global functions

Back on the warpath against string lookups. And there are several still to squash: for instance, the call to math.abs looks up the string “math” in the global table to find the maths library, then looks up the string “abs” in that table (well, actually the maths library is a special table-like object) to find the function to call.

This patch also hoists out the lookup of “date” in the global table, “imp” in the global table, and “wakeup” in the imp table – though, as those were only called once per update, the impact from those is less.

The rule of thumb is: any time you see a dot between two names, such as in “math.abs”, then that’s a string lookup. And any time you see a reference to something from the global table, such as the imp object or date function, then that’s a string lookup too. Accesses to local variables – whether those of the current function or “outers” from any function enclosing it – are not string lookups.

@@ -151,9 +151,12 @@ server.disconnect();
 
 local background = [0,0,0];
 
+local mabs = math.abs.bindenv(math);
+local getdate = date;
+local impwaker = imp.wakeup.bindenv(imp);
+
 local function proximity(a, b, limit) {
-//  return a == b
-  local diff = math.abs(a-b)
+  local diff = mabs(a-b);
   if (diff > 60-limit) {
     diff = diff - (60 - limit) + 1
   }
@@ -164,7 +167,7 @@ local function proximity(a, b, limit) {
 }
 
 function displaytime(){
-  local now = date()
+  local now = getdate()
   local hourled = (now.hour + 0) * 5 + (now.min / 12.0).tointeger()
   if (hourled > 59) {
     hourled = hourled - 60
@@ -193,7 +196,7 @@ function displaytime(){
     pixelWriter(i,red,green,blue);
   }  
   frameWriter();
-  imp.wakeup(0.25, displaytime)
+  impwaker(0.25, displaytime)
 }
 
 local getus = hardware.micros.bindenv(hardware);

Because math.abs gets called several times per pixel, the speed-up is quite significant: the time for the entire update is halved. The small increases in code size and memory usage are easily forgiven.

Time/μsCode sizeFree memory
Baseline49,3736.58%60,712b
Global variables44,6666.57%60,752b
Optimising writePixel36,4007.06%60,048b
Optimising writeFrame36,3967.17%59,908b
Optimising Neopixel calls34,0437.15%60,060b
Removing array references31,8767.13%60,044b
Global functions15,5637.31%59,752b

7. Hoist table lookups

The few remaining string lookups in displaytime are mostly references to members of the table returned by Squirrel’s date() function. As they’re used over and over again inside the loop, it should be faster to do each lookup just once, before entering the loop: in the compiler literature, this is called loop hoisting.

@@ -166,9 +166,13 @@ local function proximity(a, b, limit) {
   return 0
 }
 
-function displaytime(){
+local displaytime;
+displaytime = function(){
   local now = getdate()
-  local hourled = (now.hour + 0) * 5 + (now.min / 12.0).tointeger()
+  local hour = now.hour;
+  local min  = now.min;
+  local sec  = now.sec;
+  local hourled = (hour + 0) * 5 + (min / 12.0).tointeger()
   if (hourled > 59) {
     hourled = hourled - 60
   }
@@ -182,13 +186,13 @@ function displaytime(){
       green= (100 * hourprox * hourprox).tointeger()
       blue = 0
     }
-    local minprox = proximity(i, now.min, 2.0)
+    local minprox = proximity(i, min, 2.0)
     if (minprox != 0) {
       red = 0
       green = (100 * minprox * minprox).tointeger()
       blue = (100 * minprox * minprox).tointeger()
     }
-    if (proximity(i, now.sec, 0)) {
+    if (proximity(i, sec, 0)) {
       red = 100
       green = 0
       blue = 100
And it’s shaved a tiny bit more off the time.
Time/μsCode sizeFree memory
Baseline49,3736.58%60,712b
Global variables44,6666.57%60,752b
Optimising writePixel36,4007.06%60,048b
Optimising writeFrame36,3967.17%59,908b
Optimising Neopixel calls34,0437.15%60,060b
Removing array references31,8767.13%60,044b
Global functions15,5637.31%59,752b
Hoist table lookups14,6577.38%60,344b

8. Other optimisations

Right, the gloves are off now. Anything goes that will make this code faster, and readability be gosh-darned.

First of all, we can get rid of all the string lookups in the date structure by just using Squirrel’s time() call instead, which is just a few integer divisions away from giving us the right answer. (Or, in my case, the wrong answer – the code below incorporates a bugfix which in the Gist only appears in a later commit.)

Sort-of similarly, it turns out that open-coding the equivalent of the math.abs function is fractionally faster than actually calling it.

Then we can hoist the lookups in the background array out of the loop. These are integer lookups – not the expensive string sort – but every little helps. (And if you’re thinking, “Why not just get rid of background altogether?”, then the answer is that the actual impoclock has an agent.on() method for setting the background colour, which isn’t shown here because it’s not relevant to the optimisations.)

One final micro-optimisation is to observe that a pixel showing the second hand can’t also be showing the minute hand, and one showing the minute hand can’t also be showing the hour hand. So the ifs in the loop can be if..elses, and we can save about a gnat’s crotchet of time by skipping calculations which we know we aren’t going to use.

@@ -151,12 +151,14 @@ server.disconnect();
 
 local background = [0,0,0];
 
-local mabs = math.abs.bindenv(math);
-local getdate = date;
+local gettime = time;
 local impwaker = imp.wakeup.bindenv(imp);
 
 local function proximity(a, b, limit) {
-  local diff = mabs(a-b);
+  local diff = a-b;
+  if (diff < 0) {
+      diff = -diff;
+  }
   if (diff > 60-limit) {
     diff = diff - (60 - limit) + 1
   }
@@ -168,34 +170,32 @@ local function proximity(a, b, limit) {
 
 local displaytime;
 displaytime = function(){
-  local now = getdate()
-  local hour = now.hour;
-  local min  = now.min;
-  local sec  = now.sec;
+  local now = gettime()
+  local hour = (now/3600)%12;
+  local min  = (now/60)%60;
+  local sec  = now%60;
   local hourled = (hour + 0) * 5 + (min / 12.0).tointeger()
   if (hourled > 59) {
     hourled = hourled - 60
   }
+  local bgred = background[0];
+  local bggreen = background[1];
+  local bgblue = background[2];
   for(local i = 0; i<NUM_PIXELS; i++){
-    local red = background[0];
-    local green = background[1];
-    local blue = background[2];
-    local hourprox = proximity(i, hourled, 3.0)
-    if (hourprox != 0) {
-      red = (100 * hourprox * hourprox).tointeger()
-      green= (100 * hourprox * hourprox).tointeger()
-      blue = 0
-    }
-    local minprox = proximity(i, min, 2.0)
-    if (minprox != 0) {
-      red = 0
-      green = (100 * minprox * minprox).tointeger()
-      blue = (100 * minprox * minprox).tointeger()
-    }
+    local red = bgred, green = bggreen, blue = bgblue;
+    local hourprox, minprox;
     if (proximity(i, sec, 0)) {
-      red = 100
-      green = 0
-      blue = 100
+        red = 100
+        green = 0
+        blue = 100
+    } else if ((hourprox = proximity(i, hourled, 3.0)) != 0) {
+        red = (100 * hourprox * hourprox)
+        green= (100 * hourprox * hourprox)
+        blue = 0
+    } else if ((minprox = proximity(i, min, 2.0)) != 0) {
+        red = 0
+        green = (100 * minprox * minprox)
+        blue = (100 * minprox * minprox)
     }
     pixelWriter(i,red,green,blue);
   }  

And these final few things have between them eked out another 15% or so.

Time/μsCode sizeFree memory
Baseline49,3736.58%60,712b
Global variables44,6666.57%60,752b
Optimising writePixel36,4007.06%60,048b
Optimising writeFrame36,3967.17%59,908b
Optimising Neopixel calls34,0437.15%60,060b
Removing array references31,8767.13%60,044b
Global functions15,5637.31%59,752b
Hoist table lookups14,6577.38%60,344b
Other optimisations12,1967.44%60,392b
Total change-75.3%+13%+0.5%

9. So the final scores there

The impoclock update routine which, written straightforwardly, took nearly 50ms, got optimised down to a little over 12ms: four times as fast. Or, put differently, a peak update rate of 20 frames per second became 80 frames per second.

Most, though perhaps not quite all, of these sorts of optimisations can be applied quite widely to different Squirrel programs. And if you’re writing code that might get widely reused, you never know what timing requirements other users of your code will have, so it behoves you to optimise as much as is sensible. (Conversely, it’s not as sensible to optimise every microsecond out of a routine that nobody’s going to be calling all that often.)

The optimisations have come at some cost in readability, as well as in code size and memory usage. There’s not much that can be done about the readability – but optimising Squirrel code for size, as opposed to speed, will be the subject of another post.

No comments:

Post a Comment

About Me

Cambridge, United Kingdom
Waits for audience applause ... not a sossinge.
CC0 To the extent possible under law, the author of this work has waived all copyright and related or neighboring rights to this work.