Whitebait, Kleftiko, Ekmek Special: October 2014

This post is an elaboration of a discussion held at Electric Imp’s headquarters in October 2013, where we ran through a collection of software engineering best-practices that various team members have found useful (whether actually at EI or at previous jobs) for producing robust pieces of software. None of these are rocket science, and none are all that innovative, but having them all written down in one place helps everyone know what to expect – especially new team-members such as contractors.

Coding standards

We have coding-style guides for the main languages we use: Erlang, Javascript and C++. It’s a specific goal that we shouldn’t be able to tell, just by looking at the style of a section of code, which of us wrote it. We should probably have style guides for our other languages too: Squirrel, Objective-C, Java/Dalvik.

Each should resemble the accepted standards (or at least an accepted standard) for the language in question – as opposed to resembling each other. Our Javascript should look like idiomatic Javascript, not like idiomatic C++.

Of course, sometimes this isn’t possible. Both in Javascript and in C++, our code relies on and interacts with a large number of third-party components – which in many cases are themselves written in different styles.

Code reviews

All production code gets reviewed. Where convenient (developers in the same office), it gets reviewed before git push, but sometimes that’s not convenient (remote developers), so review-after-push is the best we can do. But where the change is large enough to have been done on a feature branch, we can get the best of both worlds by pushing that branch to the central server and doing the review before merge to master.

One obvious benefit of code reviews, is that a second pair of eyes checks over the code for things the original developer didn’t spot. (If done thoughtfully, it can be a great learning tool for more junior members of the team.)

But that isn’t actually the main benefit. The best thing about code reviews, is that they spread out knowledge about the codebase among all the team members, helping to avoid the situation where individual developers have their own “fiefdoms”, or code that only they know about. It’s the idea of the “Smallest Bus Queue Accident”, the slightly ghoulish concept of the number of team-members who would need to simultaneously be hit by a bus for the project to be imperilled. If everyone’s an irreplaceable specialist, SBQA=1, and even just one accident (or poaching, or mid-life crisis) will leave you in trouble. If everyone is familiar with everything (at least enough to maintain it), SBQA=n, the team size, and long-term risk to the project is much lower.

The film “Jurassic Park” is, at some level, a commentary on just how badly things can go wrong in a system with an inadequate (absent) culture of code review.

Unit tests

All commits that change production code, should also contain a unit-test for the change. All the unit-tests are run in every build of the codebase (top-level scons invocation). The primary and most obvious reason for this, is that we’ve all written code that flat-out didn’t do what we intended, and having our assumptions checked automatically before git push keeps those misunderstandings out of the master sources. But it turns out that there are also other benefits of cultivating a habit of unit-testing.

For one thing, a healthy collection of unit-tests can act as documentation for the code under test, showing its expected behaviour and suggesting which parts of its behaviour are intended and essential, and which are accidental and possibly unwanted. Unlike many other forms of documentation (including, uh, the form of the blog), documentation-by-unit-test is guaranteed to be up-to-date, as the build breaks any time it doesn’t match the implementation.

Another benefit of unit-testing is that it acts as a forcing-function for modularity: if it’s hard to see how to test your code one part at a time, because you can’t get in between the parts, then that’s a warning sign that the parts are more tightly-coupled is healthy for maintainability.

There is, though, a trade-off there, a judgement call to be made. In C++ there’s a cost associated with virtual functions (which are needed for mocking and for test doubles in general), and there’s also a human cost associated with the added complexity of introducing test doubles for every single class (even if they’re systematically named and located, that’s still doubling the number of classes in any library). But in general the design of the system should offer enough modularity to provide an adequate sprinkling of seams for introducing test doubles, which means that in general extracting an interface solely so that it can be mocked-out, is frowned upon. In Javascript, the performance cost part doesn’t apply (all functions are, in C++ terms, “virtual”, and so all dependencies can be mocked), but it may still be worth choosing carefully how large the “unit” in each unit-test is, with the overarching goal of retaining readability of the codebase as a whole. That is to say, the “unit” might most easily be more than one class, or more than one file.

Judgement is also needed when writing the tests themselves. It’s all too easy to write tests that are either too precise and thus too fragile against correct changes to the code, or too vague and thus capable of missing incorrect changes. One of our developers was unlucky enough to encounter a bogus test in the very first change he made to our C++ codebase: adding some debug logging of the imp’s current list of servers. A unit-test started failing because getServer() was being called “too many” times. But pretty clearly, it wasn’t part of the external contract of the class how many times it called getServer() – the test was too precise. The mantra should be: test the functionality of the unit, not the implementation. If I want to know whether the implementation is the same as it was yesterday, I’ll use diff – I want the tests to tell me whether it’s fulfilling its external contract the same way it was yesterday, following any changes to its implementation. Practice makes perfect, here: the more tests you write, the better you get at judging these things.

There’s a school of thought that says that tests are so important that you should write them first, before the implementation: test-driven development. In our experience that works well for some tasks – self-contained, purely algorithmic ones – and less well for others. For instance, when writing driver code, the largest source of bugs is an incorrect mental model of how the hardware works: but as the same developer writes the tests then the code, it’s too easy for the tests and the code to be consistent with each other but both wrong. Where it does work, though, it works brilliantly: the imp’s memory allocator was written test-first – and, as it’s a well-defined and well-understood problem, writing the tests was relatively easy.

Test-driven development is as much about improving the quality of the tests as it is of the code. If you know (or think you know) that some particular functionality will be needed in the implementation, but the tests you’ve written so far haven’t exercised it yet, it does make you stop and think: how do I write a test for that – how do I get the system into that state? Either it’s forcing you to come up with a test you didn’t realise you needed – or, even better, it’s telling you that your expected elaboration wasn’t needed at all.

All told, I don’t think it’s a huge exaggeration to say, “If you don’t have tests, you don’t really have the code” – because without the tests, you can’t do anything with it: the path to being able to refactor it, is nearly as long as the path to writing it again from scratch.

System tests

Unit tests are good at ensuring that individual modules fulfill their contracts. But the task of writing quality software is, sadly, still not done at that point: assembling these contracts together to form an overall system is still a human process, fertile with opportunities for human error. Unit testing reduces the risk of inter-module bugs, but cannot eliminate it altogether.

So at some point you’re going to have to test larger parts of the system. Michael Feathers’ rules of thumb as to what is a unit test and what isn’t, are widely quoted: the key distinction being, a unit test should be basically instantaneous and invisible, at least when successful. System tests are a bit more heavyweight, and might be more involved to set up: for instance, target hardware might be needed.

In theory, system testing proceeds via structural induction: prove that component A works, then prove that A+B works, then A+B+C, and so on until you’ve tested the whole software system. In practice, it usually suffices to test major components, and then the system as a whole: for instance, the imp hardware, then the imp server, then end-to-end tests of the whole enchilada.

Just as unit tests can be viewed as documentation for the code, system tests can be viewed as high-level specifications of the entire system’s behaviour: “when an imp contacts the server, it’s sent the appropriate Squirrel for the device it’s plugged into”.

Static analysis

The big idea with static analysis (and, for that matter, dynamic analysis), is this: for any property that your code must have in order to be correct, it’s more reliable to have an automated check that your code has that property, than it is to rely on every programmer to manually keep it in mind the whole time. For instance, Javascript: The Good Parts – a book conspicuously much thinner than “Javascript: The Definitive Guide” – is really about avoiding language features that cause misleading or buggy code. Most of the traps the author describes, can be detected by the JSLint or JSHint static-analysis tools; running one of those tools before committing, can thus keep questionable or hard-to-maintain constructs out of the codebase.

In the wild-west days of C, similar coding traps were caught by a program called lint – though when C++ came along,

Lots of warnings:

g++ -Wall -Wextra -Werror -Wundef -Wno-unused-parameter -Woverloaded-virtual -Wlogical-op -fstrict-enums -Wno-long-long -Wpointer-arith -Wnon-virtual-dtor -Wno-sign-conversion -Wunused-but-set-variable

it designed out most of the need for lint, and incorporated much of the rest into the compiler. Some of it is in the compiler only in optional warnings or errors – so we always run GCC in a mode with lots of warnings and errors. (On projects that also use other compilers – MSVC for example – it’s a good idea to turn on all the warnings in all of them, as every compiler detects different problems. But here at EI, all our targets are GCC-based.) We’re happy to keep looking, but we’ve yet to find a lint-like tool for C++ that finds real bugs.

Other languages have their own static-analysis tools: Erlang has Dialyzer, and XCode has some stuff for Objective-C. Though more are always welcome: for instance, code written against node.js makes widespread use of callbacks as part of asynchronous APIs. So it’s a correctness constraint that any routine that takes a parameter called cb, must call it exactly once (not twice, not no times) on every path through the function. That should, in theory, be amenable to automated testing (at least for most easy cases).

Dynamic analysis

Dynamic analysis tools also tend to be language-specific. In C++, the use of Valgrind is well-known; Helgrind, bundled with Valgrind, can be awkward to use but can also be invaluable; and mutrace and perf are helpful at that stage of project maturity when the question becomes, “yes, yes, it works, but why isn’t it faster?” (That’s usually a pretty late stage of maturity, on the timeless basis that it’s easier to optimise correct code, than it is to correct optimised code.)

The one act of dynamic analysis that all types of code can enjoy, though, is code-coverage analysis. Code-coverage analysis is the ugly cousin of test-driven development, but is good enough for a rainy weekend in Norfolk. Developing while keeping an eye on the coverage statistics is almost, but not quite, as reliable as test-driven development as a method of ensuring that your code has good-quality tests.

Continuous integration

We use the very splendid Jenkins continuous-integration server (autobuilder). Its main responsibility is running, after each commit, all the tests and analyses that would be too time-consuming for developers to run before each commit. Or too awkward – Jenkins runs all the system tests that require special hardware, and also runs builds on all supported host platforms.

The quality bar for pushes to master, is clean runs on all of these test suites; any failures are stop-the-line emergencies. If a build or tests is failing, the very next push must be the fix, or other developers can’t continue pushing (because they can’t know whether their own work passes that test or not). This is usually known as “do not commit on red” – although with Git, it’s actually the “push”, not “commit”, operation that’s the relevant one. Because of that, a failing build means nobody else can make progress, so fixing it is viewed as an emergency.

Using Pivotal

We use Pivotal Labs’ Pivotal Tracker to keep track of work items. As Electric Imp is spread out over several time-zones, it’s important to have an online tool that we can use to look at the status of projects or tasks without having to bother actual human beings. We looked at a few online Agile tools, and the simplicity and easy visibility of Pivotal made it definitely the best.

For those who haven’t seen it, it arranges tasks into (basically) three columns: “current”, “backlog”, and “icebox”. “Current” is for the current sprint, “backlog” is for upcoming sprints, and “icebox” for everything else. Tasks can be moved around and re-ordered by drag-and-drop. A task is either a user-story, a “chore” (which we use for refactoring or technical-debt tasks that aren’t end-user-visible), or a bug.

We use this system to work on tasks in priority order, as prioritised by the product owner:

Anyone can add new stories, chores, or bugs to a “new tasks here” section in the icebox.
The product owner has a think about the new task (if it’s a story or a bug), and prioritises it either into the backlog, or into a “nice to haves” section in the icebox.
If it’s a chore, the developers prioritise it between themselves (typically, just before a story which would be aided by the cleanup, or which would need re-doing if done before the cleanup).
As developers finish up previous tasks, they pick the topmost one in the backlog each time. (We don’t really deal in “sprints” like canonical Agile – it’s more task-by-task, like kanban.)

Using Git

Unlike, say, Subversion, Git makes having a neat project history achievable. So we try to achieve it.

Bow-shaped feature branches let us view individual features either entire or as composed of a series of patches. That’s good for cherry-picking, it’s good for (ahem) reverting, and it’s good for rereading old commits to understand the origins of a piece of code (or of a bug). In general we name single-developer feature branches with the developer’s initials plus a very short indicator of the feature, such as pdh-discovery or rs-schema. (The developer’s initials serve as a kind of “watch out, I might rebase this branch at any moment”.)

The server and the client have different needs from their respective release processes, mainly because of “in-the-drawer” syndrome: previous client releases remain important in perpetuity, because a user could stick one in a drawer, forget about it, remember a year later, and then try and connect it to modern-day servers. The same doesn’t apply the other way around: previous server releases are, to a large extent, yesterday’s newspapers.

Ensure it all happens

We’re the engineering department; we’ve all been hired to do engineering. It’s pretty likely that we’re the last line of defence for things being well-engineered and done right: very few customers or project-managers have ever been heard to say, “Yes, yes, but can’t you take a bit longer and do it properly?” – indeed, many have been heard to say the exact opposite.

So for instance, adding automated tests is always an integral part of the development process – whether it’s done strictly before coding the functionality, as the test-driven school would say, or in a more intermingled way. It’s not an optional extra. Developers incorporate the building of tests into the original development estimates, and project owners don’t (shouldn’t) accept a story as “done” unless the testing is in-place too.

Being the engineering department, we don’t necessarily get to decide the balance between engineering goals and commercial goals (in fact, we probably don’t want to do that). But it does behove us to make sure that the people who do make those decisions, know what the engineering consequences will be.

Whitebait, Kleftiko, Ekmek Special

Wednesday, 22 October 2014

Software Practices At Electric Imp