Not in fact any relation to the famous large Greek meal of the same name.

Friday, 27 January 2023

Self-hosted CI for Rust and C++ using Laminar

Previously on #homelab:

     
I’ve been a keen user of the Jenkins continuous-integration server (build daemon) since the days when it was called Hudson. I set it up on a shadow IT basis under my desk at Displaylink, and it was part of Electric Imp’s infrastructure from the word go. But on the general principle that you don’t appreciate a thing unless you know what the alternatives are like, I’ve recently been looking at Laminar for my homelab CI setup.

Laminar is a much more opinionated application, mostly in the positive sense of that term, than Jenkins. If Jenkins (as its icon suggests) is an obsequious English butler or valet, then Laminar is the stereotype of the brusque New Yorker: forgot to mark your $JOB.init script as executable? “Hey pal. I’m walking here.”

But after struggling occasionally with the complexity and sometimes opacity of Jenkins (which SSH key is it using?) the simplicity and humility of Laminar comes as a relief. Run a sequence of commands? That’s what a shell is for; it’s not Laminar’s job; Laminar just runs a single script. Run a command on a timer? That’s what cron (or anacron) is for; it’s not Laminar’s job; Laminar provides a trigger command that you can add to your crontab.

So what does it provide? Mostly sequencing, monitoring, statistics-gathering, artifact-gathering, and a web UI. (Unlike Jenkins, the web UI is read-only – but as it exposes the contents of all packages built using it, it’s still best to keep it secure.) I have mine set up to, once a week, do a rustup update and then check that all my projects still build and pass their tests with the newest nightly build (and beta, and stable, and the oldest supported version). It’s very satisfying to glance at the Laminar page and be reassured that everything still builds and works, even if I’ve been occupied with other things that week. (And conversely, on the rare occasions when a new nightly breaks something, at least I find out about it early, as opposed to it suddenly being in my way at a time when I’m filled with the urge to be writing some new thing.)

This blog post will cover:

  1. Installing Laminar
  2. CI for Chorale, a C++ package
  3. CI for Cotton, a Rust package
  4. Setting up Git to build on push
  5. CI for rustup

You should probably skim at least the first part of the C++ section even if you’re mostly interested in Rust, as it introduces some basic Laminar concepts and techniques.

By the way, it’s reasonable to wonder whether, or why, self-hosted CI is even a thing, considering that Github Actions offer free CI for open-source projects (up to a certain, but very generous, usage limit). One perfectly adequate answer is that the hobby of #homelab is all about learning how things work – learning which doesn’t happen if someone else’s cloud service is already doing all the work. But there are other good answers too: eventually (but not in this blog post) I’m going to want CI to run tests on embedded systems, STM32s and RP2040s and the like – real physical hardware, which is attached to servers here but very much not attached to Github’s CI servers. (Emulation can cover some of this, but not for instance driver work, where the main source of bugs is probably misconceptions about how the actual hardware works.) Yet a third reason is trust: for a released open source project there’s, by definition, no point in keeping the source secret. But part of the idea of these blog posts is to document best practices which commercial organisations, too, might wish to adopt – and they might have very different opinions on uploading their secret sauce to third-party services, even ones sworn to secrecy by NDA. And even a project determined to make the end result open-source, won’t necessarily be making all their tentative early steps out in the open. Blame Apple, if you like, for that attitude; blame their habit of saying, “By the way, also, this unexpected thing. And you can buy it today!”

1. Installing Laminar

This is the part where it becomes clear that Laminar is quite a niche choice of CI engine. It is packaged for both Debian and Ubuntu, but there is a bug in both the Debian and Ubuntu packages – it’s not upstream, it’s in the Debian patchset – which basically results in nothing working. So you could either follow the suggestions in the upstream bug report of using a third-party Ubuntu PPA or the project’s own binary .deb releases, or you could do what I did and install the broken Ubuntu package anyway (to get the laminar user, the systemd scripts, etc. set up), then build Laminar 1.2 from upstream sources and install it over the top.

Either way, if you navigate to the Laminar web UI (on port 8080 of the server) and see even the word “Laminar” on the page, your installation is working and you’ve avoided the bug.

The default install sets the home directory for the laminar user to /var/lib/laminar; this is the Debian standard for system users, but to make things less weird for some of the tools I’m expecting Laminar to run (e.g., Cargo), I changed it (in /etc/passwd) to be /home/laminar.

2. CI for Chorale, a C++ package

I use Laminar to do homelab builds of Chorale, a C++ project comprising a UPnP media-server, plus some related bits and pieces. For the purposes of this blog post, it’s a fairly typical C++ program with a typical (and hopefully pretty sensible) Autotools-based build system.

Laminar is configured, in solid old-school Unix fashion, by a collection of text files and shell scripts. These all live (in the default configuration) under /var/lib/laminar/cfg, which you should probably chown -R to you and also check into a Git repository, to track changes and keep backups. (The installation process sets up a user laminar, which should remain a no-login user.)

All build jobs execute in a context, which allows sophisticated build setups involving multiple build machines and so on; for my purposes everything executes in a simple context called local:

/var/lib/laminar/cfg/contexts/local.conf
EXECUTORS=1
JOBS=*

This specifies that only one build job can be run at a time (but it can be any job), overriding Laminar’s default context which allows for up to six executors: it’s just a Raspberry Pi, after all, we don’t want to overstress it.

2.1 C++: building a package

When it comes to the build directories for its build jobs, Laminar is much more disciplined (or, again, opinionated) than Jenkins: it keeps a rigid distinction between (a) the build directory itself, which is temporary, (b) a persistent directory shared between runs of the same job, the workspace, and (c) a persistent directory dedicated to each run, the archive. So the usual pattern is to keep the git checkout in the workspace (to save re-downloading the whole repo each time), then each run can do a git pull, make a copy of the sources into the build directory, do the build (leaving the workspace with a clean checkout), and finally store its built artifacts into the archive. All of which is fine except for the very first build, which needs to do the git clone. In Laminar this is dealt with by giving the job an init script (remember to mark it executable!):

/var/lib/laminar/cfg/jobs/chorale.init
#!/bin/bash -xe

git clone /home/peter/git/chorale.git .

as well as its normal run script (which also needs to be marked executable):

/var/lib/laminar/cfg/jobs/chorale.run
#!/bin/bash -xe

(
    flock 200
    cd $WORKSPACE/chorale
    git pull --rebase
    cd -
    cp -al $WORKSPACE/chorale chorale
) 200>$WORKSPACE/lock

cd chorale
libtoolize -c
aclocal -I autotools
autoconf
autoheader
echo timestamp > stamp-h.in
./configure
make -j4 release
cp chorale*.tar.bz2 $ARCHIVE/
laminarc queue chorale-package

There’s a few things going on here, so let’s break it down. The business with flock is a standard way, suggested in Laminar’s own documentation, of arranging that only one job at a time gets to execute the commands inside the parentheses – this isn’t necessarily likely to be an issue, as we’ve set EXECUTORS=1, but git would get in such a pickle if it happened that it’s a sensible precaution anyway. These protected commands update the repository from upstream (here, a git server on the same machine), then copy the sources into the build directory (via hard-linking, cp’s -l, to save time and space).

Once that’s done, we can proceed to do the actual build; the commands from libtoolize as far as make are the standard sequence for bootstrapping an Autotools-based C++ project from bare sources. (It’s not exactly Joel Test #2 compliant, mostly for Autotools reasons, although at least any subsequent builds from the same tree would be single-command.)

Chorale, as is standard for C++ packages, is released and distributed as a source tarball, which in this case is produced by the release target in the Makefile. The final cp command copies this tarball to the Laminar archive directory corresponding to this run of this job. (The archive directory will have a name like /var/lib/laminar/archive/chorale/33, where the “33” is a sequential build number.)

The final command, laminarc for “laminar client”, queues-up the next job in the chain, testing the contents of the Chorale package. (The bash -xe at the top, ensures that if the build process produces any errors, the script will terminate with an error and not get as far as kicking off the test job.)

That’s all that’s needed to set up a simple C++ build job – Laminar doesn’t have any concept of registering or enrolling a job; just the existence of the $JOB.run file is enough for the job to exist. To run it (remembering that the web UI is read-only), execute laminarc queue chorale and you should see the web UI spring into life as the job gets run. Of course, it will fail if any of the prerequisites (gcc, make, autoconf, etc.) are missing from the build machine; add them either manually (sudo apt-get install ...) or perhaps using Chef, Ansible or similar. Once the build succeeds (or fails) you can click around in the web UI to find the logs or perhaps download the finished tarball.

2.2 C++: running tests

The next job in the chain, chorale-package, tests that the packaging process was successful (and didn’t leave out any important files, for instance); it replicates what the user of Chorale would do after downloading a release. This time the run script gets the sources not from git, but from the package created by (the last successful run of) the chorale job, so no init script is needed:

/var/lib/laminar/cfg/jobs/chorale-package.run
#!/bin/bash -xe

PACKAGE=/var/lib/laminar/archive/chorale/latest/chorale-*.tar.bz2
tar xf $PACKAGE

cd chorale*
./configure
make -j4 EXTRA_CCFLAGS=-Werror
make -j4 EXTRA_CCFLAGS=-Werror check
laminarc queue chorale-gcc12 chorale-clang

Like a user of Chorale, the script just untars the package and expects configure and make to work. The build fails if that doesn’t happen. This job also runs Chorale’s unit-tests using make check. This time, we build with the C++ compiler’s -Werror option, to turn all compiler warnings into hard errors which will fail the build.

If everything passes, it’s clear that everything is fine when using the standard Ubuntu C++ compiler. The final two jobs, kicked-off whenever the chorale-package job succeeds, build with alternative compilers just to get a second opinion on the validity of the code (and to avoid unpleasant surprises when the standard compiler is upgraded in subsequent Ubuntu releases):

/var/lib/laminar/cfg/jobs/chorale-gcc12.run
#!/bin/bash -xe

PACKAGE=/var/lib/laminar/archive/chorale/latest/chorale-*.tar.bz2
tar xf $PACKAGE

GCC_FLAGS="-Werror"

cd chorale*
./configure CC=gcc-12 CXX=g++-12
make -j4 CC=gcc-12 EXTRA_CCFLAGS="$GCC_FLAGS"
make -j4 CC=gcc-12 EXTRA_CCFLAGS="$GCC_FLAGS" GCOV="gcov-12" check

New compiler releases sometimes introduce new, useful warnings; this script is a good place to evaluate them before adding them to configure.ac. Similarly, the chorale-clang job checks that the sources compile with Clang, a compiler which has often found issues that G++ misses (and vice versa). Clang also has some useful extra features, the undefined-behaviour sanitiser and address sanitiser, which help to detect code which compiles but then can misbehave at runtime:

/var/lib/laminar/cfg/jobs/chorale-clang.run
#!/bin/bash -xe

PACKAGE=/var/lib/laminar/archive/chorale/latest/chorale-*.tar.bz2
tar xf $PACKAGE

cd chorale*
./configure CC=clang CXX=clang++

# -fsanitize=thread incompatible with gcov
# -fsanitize=memory needs special libc++
#
for SANE in undefined address ; do
    CLANG_FLAGS="-Werror -fsanitize=$SANE -fno-sanitize-recover=all"
    make -j4 CC=clang EXTRA_CCFLAGS="$CLANG_FLAGS"
    make -j4 CC=clang EXTRA_CCFLAGS="$CLANG_FLAGS" GCOV="llvm-cov gcov" tests
    make clean
done

If the Chorale code passes all of these hurdles, then it’s probably about as ready-to-release as it’s possible to programmatically assess.

3. CI for Cotton, a Rust package

All the tools and dependencies required to build a typical C++ package are provided by Ubuntu packages and are system-wide. But Rust’s build system encourages the use of per-user toolchains and utilities (as well as per-project dependencies). So before we do anything else, we need to install Rust for the laminar user – i.e., as the laminar user – which requires a moment’s thought, as we carefully set up laminar to be a no-login user. So we can’t just su to laminar and run rustup-init normally; we have to use su to execute one command at a time from a normal user account.

So start by downloading the right rustup-init binary for your system – here, on a Raspberry Pi, that’s the aarch64-unknown-linux-gnu one. But then execute it (and then use it to download extra toolchains) as the laminar user (bearing in mind that rustup-init’s careful setup of the laminar user’s $PATH will not be in effect):

$
$
$
$
$
$
sudo -u laminar /home/peter/rustup-init
sudo -u laminar /home/laminar/.cargo/bin/rustup toolchain install beta
sudo -u laminar /home/laminar/.cargo/bin/rustup toolchain install nightly
sudo -u laminar /home/laminar/.cargo/bin/rustup toolchain install 1.56
sudo -u laminar /home/laminar/.cargo/bin/rustup +nightly component add llvm-tools-preview
sudo -u laminar /home/laminar/.cargo/bin/cargo install rustfilt

The standard rustup-init installs the stable toolchain, so we just need to add beta, nightly, and 1.56 – that last chosen because it’s Cotton’s “minimum supported Rust version” (MSRV), which in turn was selected because it was the first version to support the 2021 Edition of Rust, and that seemed to be as far back as it was reasonable to go. We also install llvm-tools-preview and rustfilt, which we’ll be using for code-coverage later.

So to the $JOB.run scripts for Cotton. What I did here was notice that I’ve actually got a few different Rust packages to build, and they all need basically the same things doing to them. So I took advantage of the Laminar /var/lib/laminar/cfg/scripts directory, and made all the infrastructure common among all the Rust packages. When running a job, Laminar arranges that the scripts directory is on the shell’s $PATH (and note that it’s in the cfg directory, so will be captured and versioned if you set that up as a Git checkout). This means that, as far as Cotton is concerned – after an init script that’s really just like the C++ one:

/var/lib/laminar/cfg/jobs/cotton.init
#!/bin/bash -xe

git clone /home/peter/git/cotton.git .

– the other build scripts come in pairs: one that’s Cotton-specific but really just runs a shared script which is generic across projects, and then the generic one which does the actual work. We’ll look at the specific one first:

/var/lib/laminar/cfg/jobs/cotton.run
#!/bin/bash -xe

BRANCH=${BRANCH-main}
do-checkout cotton $BRANCH
export LAMINAR_REASON="built $BRANCH"
laminarc queue cotton-doc BRANCH=$BRANCH \
         cotton-grcov BRANCH=$BRANCH \
         cotton-1.56 BRANCH=$BRANCH \
         cotton-beta BRANCH=$BRANCH \
         cotton-nightly BRANCH=$BRANCH

The assignment to BRANCH is a Bash-ism which means, “use the variable $BRANCH if it exists, but if it doesn’t exist, default to main”. This is usually what we want (in particular, a plain laminarc queue cotton will build main), but making it flexible will come in handy later when we build the Git push hook. All the actual building is done by the do-checkout script, and then on success (remembering that bash -xe means the script gets aborted on any failures) we go on to queue all the downstream jobs. Note that when parameterising jobs using laminarc’s VAR=VALUE facility, each VAR applies only to one job, not to all the jobs named.

The do-checkout script is very like the one for Chorale, including the flock arrangement to serialise the git operations, and differing only in that it takes the project and branch to build as command-line parameters – and of course includes the usual Rust build commands instead of the C++/Autotools ones. (This time we can take advantage of rustup-init’s (Cargo’s) $PATH setup, but only if we source the environment file directly.)

/var/lib/laminar/cfg/scripts/do-checkout
#!/bin/bash -xe

PROJECT=$1
BRANCH=$2
# WORKSPACE is predefined by Laminar itself

(
    flock 200
    cd $WORKSPACE/$PROJECT
    git fetch
    git checkout $BRANCH
    git pull --rebase
    cd -
    cp -al $WORKSPACE/$PROJECT $PROJECT
) 200>$WORKSPACE/lock

source $HOME/.cargo/env
rustup default stable

cd $PROJECT
cargo build --all-targets
cargo test

Notice that this job explicitly uses the stable toolchain, minimising the chance of version-to-version breakage. We also want to test on beta, nightly, and MSRV though, which is what three of those downstream jobs are for. Here I’ll just show the setup for nightly, because the other two are exactly analogous. Again there’s a pair of scripts; firstly, there’s the specific one:

/var/lib/laminar/cfg/jobs/cotton-nightly.run
#!/bin/bash -xe

exec do-buildtest cotton nightly ${BRANCH-main}

Really not much to see there. All the work, as before, is done in the generic script, which is parameterised by project and toolchain:

/var/lib/laminar/cfg/scripts/do-buildtest
#!/bin/bash -xe

PROJECT=$1
RUST=$2
BRANCH=$3
SOURCE=/var/lib/laminar/run/$PROJECT/workspace

(
    flock 200
    cd $SOURCE/$PROJECT
    git checkout $BRANCH
    cd -
    cp -al $SOURCE/$PROJECT $PROJECT
) 200>$SOURCE/lock

source $HOME/.cargo/env
rustup default $RUST

cd $PROJECT
cargo build --all-targets --offline
cargo test --offline

Here we lock the workspace again, just to avoid any potential clashes with a half-finished git update, but we don’t of course do another git update – we want to build the same version of the code that we just built with stable. For similar reasons, we run Cargo in offline mode, just in case anyone published a newer version of a dependency since we last built.

That’s the cotton-beta, cotton-nightly, and cotton-1.56 downstream jobs dealt with. There are two more: cotton-doc and cotton-grcov, which deal with cargo doc and code coverage respectively. The documentation one is the more straightforward:

/var/lib/laminar/cfg/jobs/cotton-doc.run
#!/bin/bash -xe

exec do-doc cotton ${BRANCH-main}

And even the generic script (parameterised by project) is quite simple:

/var/lib/laminar/cfg/scripts/do-doc
#!/bin/bash -xe

PROJECT=$1
BRANCH=$2
SOURCE=/var/lib/laminar/run/$PROJECT/workspace

(
    flock 200
    cd $SOURCE/$PROJECT
    git checkout $BRANCH
    cd -
    cp -al $SOURCE/$PROJECT $PROJECT
) 200>$SOURCE/lock

source $HOME/.cargo/env
rustup default stable

cd $PROJECT
cargo doc --no-deps --offline
cp -a target/doc $ARCHIVE

It much resembles the normal build, except for running cargo doc instead of a normal build. On completion, though, it copies the finished documentation into Laminar’s $ARCHIVE directory, which makes it accessible from Laminar’s web UI afterwards.

The code-coverage scripts are more involved, largely because I couldn’t initially get grcov to work, and ended up switching to using LLVM’s own coverage tools instead. (But the scripts still have “grcov” in the names.) Once more the per-project script is simple:

/var/lib/laminar/cfg/jobs/cotton-grcov.run
#!/bin/bash -xe

exec do-grcov cotton ${BRANCH-main}

And the generic script does the bulk of it (I cribbed this recipe from the rustc book, q.v.; I didn’t come up with it all myself):

/var/lib/laminar/cfg/scripts/do-grcov
#!/bin/bash -xe

PROJECT=$1
BRANCH=$2
SOURCE=/var/lib/laminar/run/$PROJECT/workspace

(
    flock 200
    cd $SOURCE/$PROJECT
    git checkout $BRANCH
    cd -
    cp -al $SOURCE/$PROJECT $PROJECT
) 200>$SOURCE/lock

source $HOME/.cargo/env
rustup default nightly

cd $PROJECT

export RUSTFLAGS="-Cinstrument-coverage"
export LLVM_PROFILE_FILE="$PROJECT-%p-%m.profraw"
cargo test --offline --lib
rustup run nightly llvm-profdata merge -sparse `find . -name '*.profraw'` -o cotton.profdata
rustup run nightly llvm-cov show \
    $( \
      for file in \
        $( \
            cargo test --offline --lib --no-run --message-format=json \
              | jq -r "select(.profile.test == true) | .filenames[]" \
              | grep -v dSYM - \
        ); \
      do \
        printf "%s %s " -object $file; \
      done \
    ) \
  --instr-profile=cotton.profdata --format=html --output-dir=$ARCHIVE \
  --show-line-counts-or-regions --ignore-filename-regex='/.cargo/' \
  --ignore-filename-regex='rustc/'

Honestly? Bit of a mouthful. But it does the job. Notice that the output directory is set to Laminar’s $ARCHIVE directory so that, again, the results are viewable through Laminar’s web UI. (Rust profiling doesn’t produce branch coverage as such, but “Region coverage” – which counts what a compiler would call basic blocks – amounts to much the same thing in practice.) The results will look a bit like this:

Why yes, that is very good coverage, thank you for noticing!

4. Setting up Git to build on push

So far in our CI journey, we have plenty of integration, but it’s not very continuous. What’s needed is for all this mechanism to swing into action every time new code is pushed to the (on-prem) Git repositories for Chorale or Cotton.

Fortunately, this is quite straightforward – or, at least, good inspiration is available online. Pushes to the Git repository for Cotton can be hooked by adding a script as hooks/post-receive under the Git server’s cotton.git directory (the hooks directory is probably already there). In one of those Git features that at first makes you think, “this is a bit over-engineered”, but then makes you realise, “wait, this couldn’t actually be made any simpler while still working in full generality”, the Git server passes to this script, on its standard input, a line for every Git “ref” being pushed – for these purposes, refs are mostly branches – along with the Git revisions at the old tip and new tip of the branch.

Laminar comes with an example hook which builds every commit on every branch pushed. I admire this but don’t follow it; it’s great for preserving bisectability, but seems like it would lead to a lot of interactive rebasing every time a feature branch is rebased on a later main – not to mention a lot of building by the CI server. So the hook I actually use just builds the tip of each branch:

git/cotton.git/hooks/post-receive
#!/bin/bash -ex

while read oldrev newrev ref
do
    if [ "${ref:0:11}" == "refs/heads/" ];
    then
     export BRANCH=${ref:11}
     export LAMINAR_REASON="git push $BRANCH"
     laminarc queue cotton BRANCH=$BRANCH
    fi
done

The LAMINAR_REASON appears in the web UI and indicates which branch each run is building:

5. CI for rustup

The final piece of the puzzle, at least for today, is the continuous integration of Rust itself. As new nightlies, and betas, and even stable toolchains come out, I’d like it to be a computer’s job, not that of a person, to rebuild everything with the new version. (Especially if that person would be me.)

This too, however, is straightforward with all the infrastructure put in place by the rest of this blog post. All that’s needed is a new job file which runs rustup-update:

/var/lib/laminar/cfg/jobs/rustup-update.run
#!/bin/bash -ex

export LAMINAR_REASON="rustup update"
source $HOME/.cargo/env
rustup update
laminarc queue cotton assay sparkle

The rustup update command updates all the toolchains; once that is done, the script queues-up builds of all the Rust packages I build. I schedule a weekly build in the small hours of Thursday morning, using cron:

edit crontab using “crontab -e”
0 3 * * 4 LAMINAR_REASON="Weekly rustup" laminarc queue rustup-update

With a bit of luck, this means that by the time I sit down at my desk on Thursday morning, all the jobs have run and Laminar is showing a clean bill of health. As I’ve been playing with Rust for quite a long elapsed time, but really only taking it seriously in quite occasional bursts of energy, having everything kept up-to-date automatically during periods when I’m distracted by a different shiny thing is a real pleasure.

Monday, 16 January 2023

Hub-and-spoke backups using Syncthing

Syncthing is a synchronisation tool which can keep files and directories synchronised across several computers. This post is about how I set it up for secure backups of various Linux boxes, including off-site backup. It is based on Syncthing’s own excellent documentation and mostly documents the particular configuration I use to maintain various security constraints:

No computer holds, at rest, credentials to log into any other
SSH public keys are stored encrypted, with a passphrase required for usage.
No port except (somewhat unavoidably) SSH is exposed to the internet
Syncthing seems well-maintained and safe. But there’s no point offering an attack surface of both Syncthing and SSH, when I can just hide Syncthing inside SSH.
No computer holds, at rest, unencrypted files
The computers involved all either have full-disk encryption or home-directory encryption enabled. Information needing extra security (financial data etc.) is kept in an additional layer of encryption using EncFS and/or encfs-agent, and only opened when required; Syncthing sychronises the encrypted form.

I’ve got Syncthing installed on a total of four computers, in a hub-and-spoke manner:

  • A Linux desktop, where I do most of my work;
  • A Linux laptop (it’s a Lenovo Thinkpad);
  • A Raspberry Pi that’s my home server;
  • A second Raspberry Pi at a friend’s house as an off-site backup.

For the purposes of this post, we’ll call these boxes desktop, laptop, central, and offsite respectively. The hub of the hub-and-spoke setup is central; the other three only ever synchronise with central and not with each other. Both laptop and desktop connect inbound to central; central connects outbound to offsite. Each interaction is a full two-way sync, but in my case almost all new data flows from desktop and laptop to central and then offsite; almost no new data flows in the other direction.

Syncthing has its own package repository for Debian/Ubuntu, so installing it (on both clients and servers) involves following the instructions at https://apt.syncthing.net; there are packages for ARM64 Ubuntu (the Raspberry Pi) as well as AMD64 Ubuntu (for desktop and laptop). It typically runs in the background and is operated using a web GUI on port 8384; by default this (quite rightly) listens only for connections from localhost. On central, Syncthing starts as a service at boot time, and on the other boxes it is manually run when needed – this is a backup-on-demand setup, not automated backup.

Syncthing apparently also runs on Windows and Macintosh, but all the boxes I need to use it on run Linux.

Hub setup

The server, central, runs headless. To interact with the Syncthing web GUI from desktop, I need to SSH to central and use port forwarding. A script, on desktop, called ssh-central provides this:

ssh-central (1/2)
#!/bin/bash

# 8000: central syncthing UI

exec ssh -A -X -t \
     -L localhost:8000:localhost:8384 \
     peter@central

Alternatively, options could be added in ~/.ssh/config for host central, but doing it this way means that I can still SSH to central normally, with no port forwards, when needed.

Once that script is running and has connected (offering a shell prompt on central in your terminal window), going to https://localhost:8000 in your web browser will bring up Syncthing’s user interface; again consult Syncthing’s own documentation for what you’re looking at there.

By default Syncthing exposes the synchronisation protocol to all comers on a given port; this isn’t horrific as there is mutual authentication and a key exchange required before any actual synchronisation happens – but, on the general principle of minimising attack surface, I set it up to work like the GUI: available only to localhost, and requiring anyone else to set up SSH port forwarding in order to connect. From the Actions menu choose Settings, then Connections, and set the “Sync Protocol Listen Addresses” to tcp4://127.0.0.1:22000 and turn off NAT traversal, local discovery, and global discovery:

Adding “dial-in” spokes (workstations)

Although I’m using Syncthing in, effectively, a client/server mode, the protocol itself treats all connections as symmetrical peers. This means that two ports need to be forwarded over SSH by desktop: one for desktop’s Syncthing to connect to central’s Syncthing, and another in the opposite direction. (Honestly, I don’t see why it can’t do everything it needs over just one bidirectional TCP stream, but all the documentation suggests that it wants two.)

So this means that each workstation needs its own port allocated on the server, to be the listening address for server-to-workstation connections. I keep track of these in a file called lordwarden.txt on central...

lordwarden.txt
          sync-port
desktop   22001
laptop    22002
offsite   22003

...but that’s because I have an inveterate fondness for appalling puns, and you should pick a more sensible filename.

Once you’ve picked a port, you can write the do-syncthing script for each workstation. Here’s the one on desktop:

do-syncthing (desktop)
#!/bin/bash -x

ssh -N \
   -L localhost:22001:localhost:22000 \
   -R localhost:22001:localhost:22000 \
   peter@central &
SSHPID=$!
syncthing
kill $SSHPID
echo Done!

This sets up the SSH tunnel and starts the local (to desktop) instance of Syncthing. While this is running in a terminal, going to https://localhost:8384/ in your web browser will show the Syncthing user interface.

Here you need to set up the “Sync Protocol Listen Addresses” just like above, then (from the main screen) choose “Add Remote Device”. You’ll need to enter central’s Syncthing “Device ID”, which you can find in central’s user interface by opening the Actions menu and choosing “Show ID”. (I always just copy and paste the textual form and don’t bother with the QR code.) Then in the “Advanced” tab of the “Add Device” dialog, enter tcp://127.0.0.1:22001 – the port you chose for it in lordwarden.txt – under “Addresses” and then click “Save” to confirm.

It may take Syncthing a little while to successfully connect, but eventually central’s Syncthing will ask you to confirm the new peer’s identity, and the two boxes will be connected.

By default Syncthing just synchronises a default directory under $HOME, but you can add others and choose which peers to share them with. One useful directory to share is $HOME/.local/share/evolution/mail/local, which is the Evolution mail client’s “On This Computer” top-level maildir directory. The maildir format (unlike mbox) is very amenable to being shared in this way without risking corruption – though Evolution doesn’t always notice when the currently-selected folder is changed behind its back, and you have to select a different folder and then click back again.

Once you’ve finished setting-up and synchronising directories, go to the “Actions” menu on desktop and choose “Shutdown”; the do-syncthing shell-script will then exit cleanly.

On subsequent invocations of do-syncthing, there is of course nothing further to configure, and Syncthing will immediately perform the synchronisation, offering progress reports through its user interface.

Adding the “dial-out” spoke (offsite)

This is just a little harder, because you aren’t sitting in front of the box, because it’s off-site.

For the purposes of this post I’m going to gloss over the process of setting up a Raspberry Pi, installing it at a friend’s house, arranging that it has a port tunnelled through the friend’s firewall, and signing up with a free dynamic DNS service such as DuckDNS (no commercial relationship, just a satisfied customer) to give it a resolvable domain-name that will stay updated whenever your friend’s internet provider changes their IP address. So I’ll just assume that you’ve arranged that offsite.duckdns.org resolves to a box running an SSH server accessible on port 20202. Bear in mind that if you’ve set up your off-site Raspberry Pi to need a disk-encryption passphrase on every boot, you’ll need to visit your friend’s house to re-enter the passphrase every time they have a power cut!

(You could, of course, equally well use a hosting service or cloud service for your off-site backup – though, depending on how much you trust your hosting provider, you might want to look into Syncthing’s untrusted devices functionality. At time of writing, that is marked as being for beta/testing only.)

The security constraints at the top of this post, exclude the possibility that central statically has any credentials to log in to offsite. So to synchronise the two, you’ll need to sit at desktop or laptop with ssh-agent running, SSH to central with ssh-agent forwarding (the -A option visible in the ssh-central script), and then SSH again from central to offsite.

The script that connects from central to offsite needs to forward two Syncthing ports (as noted in lordwarden.txt) plus the port for the Syncthing user interface on offsite:

sync-offsite (central)
#!/bin/bash

exec ssh -L 127.0.0.1:22003:127.0.0.1:22000 \
    -R 127.0.0.1:22003:127.0.0.1:22000 \
    -L 127.0.0.1:8040:127.0.0.1:8384 \
    -p 20202 peter@offsite.duckdns.org syncthing
But of course that only forwards the user interface as far as central; to get it all the way to desktop where you can actually see it, you need to add another port forward to desktop’s ssh-central script:
ssh-central (2/2)
#!/bin/bash

# 8000: central syncthing UI
# 8040: offsite syncthing UI

exec ssh -A -X -t \
    -L localhost:8000:localhost:8384 \
    -L localhost:8040:localhost:8040 \
    peter@central

Armed with all of this, you can sit at desktop, run ssh-central to log in to central, and once there run sync-offsite to reach offsite. At this point offsite’s Syncthing can be reached in your web browser on desktop as https://localhost:8040, and you can set about configuring it. (But if offsite is a Raspberry Pi 3, it will take a little while to respond to the initial requests.) Set its “Sync Protocol Listen Addresses” to tcp4://127.0.0.1:22000 as above, and then add central as a new device, entering tcp://127.0.0.1:22000 as its “Addresses”. You might also need to tell central that offsite’s address is tcp://127.0.0.1:22003. Again it might take a little while to connect, but once it has you can tell it which folders to share and let the synchronisation proceed to completion. Once everything is synchronised, go to the “Actions” menu on offsite and choose “Shutdown”; the sync-offsite shell-script will then exit cleanly.

Again, any further invocations of ssh-central and sync-offsite will immediately start the synchronisation process without any more configuration needed.

Saturday, 9 April 2022

Raspberry Pi as home server, with Ubuntu 22.04, USB boot, and headless FDE

     
Recently, and it might be said precipitately, Arch Linux ARM ended all support for ARMv5 architectures, including the Tonidoplug (Marvell Kirkwood) device which was acting as my home server. So after looking around a little for cheap, low-power ARM servers, I landed on the Raspberry Pi 3, largely because I already had a couple in a drawer. (As indeed does everybody else, which bodes well for avoiding surprise deprecations by distros.)

I thought I’d take the opportunity to improve the security story while I was at it, including adding Full Disk Encryption (FDE) using LUKS. But how is that to work on a headless (no screen or keyboard attached) server? How is the decryption key or passphrase to be entered?

Perhaps I could SSH into the box and type in the passphrase. But if it needs the passphrase to decrypt the root filesystem, how is the SSH server running? The answer is that it’s possible to set up Dropbear, a lightweight SSH server, so that it runs in the kernel’s initrd before the root filesystem is mounted. Ubuntu provides handy packages for dropbear-initramfs and cryptsetup-initramfs to make this easier.

The usual (and easiest) way to set up FDE on an x86 or x86_64 server is by using the normal Ubuntu installer. But installing Ubuntu on a Raspberry Pi involves its own custom installer, and there isn’t an easily-ticked option that just does it for you. So you’ll need to do it manually, as outlined in the following description. (Which might also be of use to anyone upgrading any other Ubuntu installation to use FDE – though be warned that it’s not possible to upgrade in-place, and we manage it on the Raspberry Pi because we’re also moving to a new root filesystem.)

To follow these steps you’ll need:

  1. A Raspberry Pi 3B. (A Raspberry Pi 4, with its USB3 controller, would have higher performance – but they’re hard to get hold of at the moment.)
  2. A proper power supply for the Raspberry Pi. Raspberry Pis are notoriously picky about power supplies, especially when driving USB bus-powered devices, so I used the branded Raspberry Pi one. Note that the Raspberry Pi 4 would need a different power supply.
  3. An existing PC able to write images to a microSD card; I used Ubuntu but other Linuxes, Windows and Macintosh can also be made to work. (Many laptops, and some monitors, have SD slots, but you might still need a microSD-to-SD or microSD-to-USB adaptor.)
  4. A microSD card – I used an 8GB one. It’s probably only needed during setup, so you could borrow one from a phone or MP3 player – but it will be erased during the process, so don’t use a vital one. (The “probably” is because, if your USB SSD is not quite Raspberry Pi compatible, you might need to keep it as a first-stage bootloader.)
  5. A USB SSD, to be the main storage for your new server – I used a Samsung T7. (For some reason I assumed it would be 2.5in-sized, but of course it’s not, it’s smaller – about the size of the Raspberry Pi’s own PCB – which will make the overall thing much neater and tidier.)

Out of the box, the Raspberry Pi will only boot from a microSD card, so the first course of action is to enable booting from USB devices. This might not appear essential – microSD cards are very cheap, and it could easily be left in even with most of the data on the USB – but whether it’s the Raspberry Pi’s fault or that of the microSD card, running a Raspberry Pi from microSD has a bad reputation for long-term reliability. Similarly, using a device marketed as a USB SSD is likely to mean that it’s intended for heavier use than a USB thumb drive, as well as potentially being higher-performance.

Installing Ubuntu on a microSD card

Ubuntu have a good tutorial which you should just follow. The steps are basically these:

  1. Put the microSD card into the adaptor and insert into your PC.
  2. Install rpi-imager (on Ubuntu, sudo snap install rpi-imager) and run it.
  3. Click “Choose OS”, then “Other general purpose OS”, then “Ubuntu”, then pick Ubuntu Server 20.04 or 22.04 64-bit from the list. (If you want 22.04 and it’s not on the list yet, download ubuntu-22.04-preinstalled-server-arm64+raspi.img.xz from the Ubuntu release site and choose “Use custom” from the menu; you don’t need to un-XZ the img.xz file as rpi-imager will use the compressed version just fine.)
  4. Click “Choose Storage” then choose your microSD card from the list (carefully! important other disks that you don’t want wiped will also be listed).
  5. Write the image to the microSD card, then insert it back into the Pi and power-on the Pi to boot it.

The only wrinkle I found, was discovering the Raspberry Pi’s IP address after it came up on the network for the first time. The suggested

arp -na | grep -i "b8:27:eb"
wasn’t finding it, even following
ping -b [my local broadcast address]

and the only answer seemed to be attempting SSH connections to addresses in my DHCP range in turn until one responded.

The next thing is to enable USB booting. There is a tutorial at thepi.io, but it is Raspbian-centric; for Ubuntu use, firstly note that on Ubuntu 20.04 only the config file to modify is not /boot/config.txt but /boot/firmware/usercfg.txt. Also, the mentioned vcgencmd command is not installed by default in Ubuntu, but it’s available via sudo apt install libraspberrypi-bin.

Settings in usercfg.txt are only applied to the actual non-volatile configuration at boot time, so the reboot in the middle of that tutorial isn’t optional. In summary, the steps are, for 20.04:

sudo apt update
sudo apt upgrade
echo program_usb_boot_mode=1 | sudo tee -a /boot/firmware/usercfg.txt
sudo reboot
... log in again and ...
sudo apt install libraspberrypi-bin
vcgencmd otp_dump | grep 17
or for 22.04:
sudo apt update
sudo apt upgrade
echo program_usb_boot_mode=1 | sudo tee -a /boot/firmware/config.txt
sudo reboot
... log in again and ...
sudo apt install libraspberrypi-bin
vcgencmd otp_dump | grep 17
Either way the final output should be the value 3020000a.

If you have several Raspberry Pi devices to deal with, boot this image on all of them at this stage to update the non-volatile configuration of each of them; that means you can just swap USB devices later.

You might also want to change the hostname and default username at this point from the current ubuntu@ubuntu, here’s an example for bob@porpentine:

sudo hostnamectl set-hostname porpentine
sudo adduser bob
sudo adduser bob sudo
sudo adduser bob admin
Don’t delete the ubuntu user until you’re sure that you can SSH in as bob and use sudo!

Moving the Ubuntu installation to an encrypted USB SSD

Much of this is based on Hamy’s (non-Raspberry-Pi-based) headless FDE tutorial, but with adaptations for the Raspberry Pi scenario.

  1. Plug the USB SSD into the Pi and check (using cat /proc/partitions) that it appears as /dev/sda. (If your SSD was supplied pre-formatted, you might have /dev/sda1 too; that’s fine as we’re going to delete it.)
  2. Now install the necessary initramfs packages:
    sudo apt install dropbear-initramfs cryptsetup-initramfs busybox-initramfs
  3. You’ll need to do a little setup for dropbear: first, find the SSH authorized_keys file corresponding to the keys/users that you’d like to be able to unlock the server, and copy it into the Dropbox initramfs directory: on 20.04 this is /etc/dropbear-initramfs:

    sudo cp .ssh/authorized_keys /etc/dropbear-initramfs/
    although on 22.04 it’s /etc/dropbear/initramfs:
    sudo cp .ssh/authorized_keys /etc/dropbear/initramfs/

    (Watch out, as the default Ubuntu 20.04 installation puts an empty file in ~/.ssh/authorized_keys – make sure you’re using a real one, probably scp’d on from an existing Linux installation).

    Now you’ll need to edit the configuration file – the available editor is nano. On Ubuntu 20.04 it’s:

    sudo nano /etc/dropbear-initramfs/config
    and on 22.04 it’s:
    sudo nano /etc/dropbear/initramfs/dropbear.conf

    In either case, uncomment the #DROPBEAR_OPTIONS= line and set it to:

    DROPBEAR_OPTIONS="-p 999 -s -j -k"

    This sets the listening port to 999 (from the default 22); doing this means that your SSH client won’t get confused that it’s connecting to a “different” SSH server on the same IP and port as the normal Ubuntu SSH server. The remaining options increase security by disabling password logins and port forwarding.

  4. Now you need to partition the USB SSD. On the Pi, run sudo fdisk /dev/sda. Using fdisk is beyond the scope of this note, but you want to delete any existing partitions and add two more: /dev/sda1 the same size as /dev/mmcblk0p1 (which will be small, 256Mbytes), FAT formatted, for the bootloader and ramdisk, plus /dev/sda2, as yet unformatted, for the rest of the disk. The resulting partition table (“p” command in fdisk) should look a bit like this:
    Command (m for help): p
    Disk /dev/sda: 931.53 GiB, 1000204886016 bytes, 1953525168 sectors
    Disk model: PSSD T7
    Units: sectors of 1 * 512 = 512 bytes
    Sector size (logical/physical): 512 bytes / 512 bytes
    I/O size (minimum/optimal): 512 bytes / 512 bytes
    Disklabel type: dos
    Disk identifier: 0x404755d2
    
    Device     Boot   Start        End    Sectors  Size Id Type
    /dev/sda1  *      2048     526335     524288   256M  c W95 FAT32 (LBA)
    /dev/sda2       526336 1953525167 1952998832 931.3G 83 Linux
    
    Filesystem/RAID signature on partition 1 will be wiped.

    This should look broadly similar to the partition table of the existing microSD card installation (fdisk -l /dev/mmcblk0):

    Disk /dev/mmcblk0: 7.41 GiB, 7948206080 bytes, 15523840 sectors
    Units: sectors of 1 * 512 = 512 bytes
    Sector size (logical/physical): 512 bytes / 512 bytes
    I/O size (minimum/optimal): 512 bytes / 512 bytes
    Disklabel type: dos
    Disk identifier: 0xf66f0719
    
    Device         Boot  Start      End  Sectors  Size Id Type
    /dev/mmcblk0p1 *      2048   526335   524288  256M  c W95 FAT32 (LBA)
    /dev/mmcblk0p2      526336 15523806 14997471  7.2G 83 Linux
  5. Now let’s actually make the encrypted block device that will contain the root filesystem. (This section is adapted from an Alpine Linux tutorial.) Lay out the device using that tutorial’s “optimised for security” settings:
    sudo cryptsetup -v -c aes-xts-plain64 -s 512 --hash sha512 --pbkdf pbkdf2 --iter-time 5000 --use-random luksFormat /dev/sda2

    You will need to enter (for the first time!) the decryption passphrase, and repeat it to guard against mistypings.

    If we just wanted one encrypted partition, we might be nearly done here. But to lay out multiple partitions inside the encrypted block device, we need to also use LVM (Logical Volume Manager). We create a “PV” (Physical Volume) inside the encrypted block device, then a “VG” (Volume Group) inside the PV, then two “LVs” (Logical Volumes) inside the VG. LVM can do much more sophisticated things than we’re using it for here!

    First, open the LUKS device:

    sudo cryptsetup luksOpen /dev/sda2 lvmcrypt

    – the passphrase is of course required. Now the encrypted version of the block device is available as /dev/mapper/lvmcrypt.

    Then, create the LVM PV and the VG inside it:

    sudo pvcreate /dev/mapper/lvmcrypt
    sudo vgcreate vg0 /dev/mapper/lvmcrypt

    Now we can create the partitions we actually want, as LVs inside that VG. We’ll create two: a swap partition and a root partition:

    sudo lvcreate -L 16G vg0 -n swap
    sudo lvcreate -l 100%FREE vg0 -n root

    We size the swap partition at 16 Gbytes, and the root partition as the whole rest of the device – note the tricksy difference between “-l” and “-L”. Our two new partitions appear as /dev/vg0/swap and /dev/vg0/root, as can be verified by sudo lvscan. (They also appear as /dev/mapper/vg0-swap and /dev/mapper/vg0-root.)

    Now we can actually do something with those partitions:

    sudo mkfs.ext4 /dev/vg0/root
    sudo mkswap /dev/vg0/swap

    To use the new root filesystem, we need to manually add entries to /etc/fstab and /etc/crypttab. So in /etc/fstab change the “/” entry to read:

    /dev/mapper/vg0-root / ext4 defaults 0 1

    and in /etc/crypttab add:

    lvmcrypt /dev/sda2 none luks,discard,initramfs

    The “initramfs” is important – the initramfs setup scripts realise that it’s important to unlock the root filesystem at startup, but /dev/sda2 isn’t (at the time you’re executing these commands) the current root filesystem – so if you don’t go out of your way to tell it that unlocking /dev/sda2 is important, the initramfs setup scripts will omit the necessary components to unlock it!

    We need to copy the root filesystem into the new root filesystem; there are various ways to do this, but I just did:

    sudo mount /dev/vg0/root /mnt
    sudo rsync -xrpvltoD / /mnt/

    You’ll also need to edit the kernel command line to tell it where the root is: edit /boot/firmware/cmdline.txt to add a “root=” parameter – and, just to be sure, an “ip=dhcp” parameter:

    net.ifnames=0 dwc_otg.lpm_enable=0 console=serial0,115200 console=tty1 root=/dev/mapper/vg0-root rootfstype=ext4 elevator=deadline rootwait fixrtc ip=dhcp

    During experimentation, it might be convenient to setup two alternative cmdline.txt files – cmdssd.txt and cmdmmc.txt, say – so that switching back and forth can be done using a simple “cp” command such as can be found in an initramfs rescue shell!

  6. Remake the initramfs incorporating these changes (on a Raspberry Pi, running update-initramfs also copies the image to the firmware partition):
    sudo update-initramfs -u
  7. Check that your initramfs actually contains a crypttab – this caused me a lot of frustration on the way to figuring this out:
    unmkinitramfs /boot/firmware/initrd.img tmp-init/
    cat tmp-init/cryptroot/crypttab

    The output should be one line much like /etc/crypttab (for me, it had changed the device name to a UUID, but the effect is the same; use the blkid command to check whether it picked the right UUID).

  8. Now it’s the moment of truth – time to reboot the Pi and see whether it comes up with Dropbear waiting for your passphrase.
    sudo shutdown -r now

    Your SSH session will of course be disconnected. So, once the Raspberry Pi has rebooted (ethernet LEDs go off, then on again), connect to the initramfs Dropbear server we installed:

    ssh -p 999 root@192.168.168.103

    If all is well, you will see this, or something very like it:

    BusyBox v1.30.1 (Ubuntu 1:1.30.1-4ubuntu6.4) built-in shell (ash)
    Enter 'help' for a list of built-in commands.
    
    #

    (If all is not well, you might reluctantly need to attach a monitor and keyboard to the Raspberry Pi to see what went wrong.)

    But in the happy case, the Raspberry Pi has loaded its initramfs, but not mounted root yet. So you have very limited tools to play with, but the one thing you should have is the cryptroot-unlock command. This will ask you for your passphrase, open the root filesystem, and allow booting to proceed (your SSH connection will be closed). You’ll need to repeat this login and unlock process every time the Raspberry Pi reboots – and every time there’s a power-cut!

  9. As it stands, the Raspberry Pi is booting from microSD, but then mounting its root filesystem from the USB SSD. The final stage is to clean this up by letting the Raspberry Pi boot straight from USB. So we need the /boot/firmware partition to be on the USB SSD, as /dev/sda1. Because we arranged that /dev/sda1 and /dev/mmcblk0p1 are exactly the same size, you can just use dd to copy the filesystem across:
    sudo dd if=/dev/mmcblk0p1 of=/dev/sda1
    sudo umount /boot/firmware
    sudo mount /dev/sda1 /boot/firmware
    ls /boot/firmware

    If your list of files includes lots of “dtb” files and so on, everything went to plan. So edit /etc/fstab one last time, and change the /boot/firmware line to say:

    /dev/sda1 /boot/firmware vfat defaults 0 1

    Ubuntu 20.04 (only) adds a further wrinkle here that is not needed in Raspbian or Ubuntu 22.04 builds. So far we’ve made sure that the Raspberry Pi onboard firmware is happy to boot from USB, and we’ve made sure the kernel is happy that its root is on USB. But on a default Ubuntu 20.04 installation, U-Boot will still torpedo us. Unlike on Raspbian, the kernel= lines in /boot/firmware/usercfg.txt do not name the actual kernel, they name the U-Boot secondary bootloader. And U-Boot, at least in the Ubuntu 20.04 Raspberry Pi images, only knows how to boot from microSD. But fortunately we can bypass U-Boot altogether (it’s not clear what benefit it’s giving us here). This is done by editing /boot/firmware/usercfg.txt and

    • every time there’s a kernel= line, changing it to:
      kernel=vmlinuz
      initramfs initrd.img followkernel
    • commenting-out the device_tree_address line:
      # device_tree_address=0x03000000

    The top of the resulting file should look like this:

    [pi4]
    kernel=vmlinuz
    initramfs initrd.img followkernel
    # kernel=uboot_rpi_4.bin
    
    [pi2]
    kernel=vmlinuz
    initramfs initrd.img followkernel
    # kernel=uboot_rpi_2.bin
    
    [pi3]
    kernel=vmlinuz
    initramfs initrd.img followkernel
    # kernel=uboot_rpi_3.bin
    
    [pi0]
    kernel=vmlinuz
    initramfs initrd.img followkernel
    
    [all]
    # device_tree_address=0x03000000

    This change has already been made in the Ubuntu 22.04 images, so those edits are not needed.

  10. Finally, it is the moment of, uh, further truth. Shut down the Pi:
    sudo shutdown -h now

    And, once it’s off (the green LED goes off and stays off), unplug or switch off power and remove the microSD card. Now power it back on again. Once the Ethernet LEDs come on, you’ll need to repeat the login-and-unlock process. But now you should have a fully armed and operational, full-disk-encrypted, empty Raspberry Pi server at your command!

    If at this stage nothing happens – the Ethernet LEDs do not come on, bearing in mind it can take nearly a minute – (and if you have a Raspberry Pi 3), it may be that your USB SSD is not compatible with the onboard bootloader. In that case you can either keep the microSD card inserted indefinitely (and revert the /etc/fstab change in Step 9), or look into the Raspberry Pi’s “Special bootcode.bin-only boot mode”.

    The latter is preferable, as it means the microSD card is never written to, which hopefully means it never risks corruption. You can either use the same microSD card, or a smaller one – though the smallest microSD cards readily available are 2Gbyte, of which bootcode.bin will take up just 52Kbytes, or 0.003%...

  11. And there’s one final caveat: I did this and it works. But it seems that the Samsung USB SSD uses a lot of power – so much so, that even with the branded power supply, running it makes USB+5V droop so far that no other USB peripheral will work at the same time including the branded Raspberry Pi keyboard. For true headless use this is of course fine, but it does limit any additional uses of the same Raspberry Pi. It’s likely that using a powered USB hub would fix this issue.

Tuesday, 4 December 2018

A rebasing vendor-branch workflow for Git

An earlier post laid out a neat way of managing third-party code, in a subdirectory of a larger Git repository, as a vendor branch. Here at Electric Imp we followed that scheme for the subsequent five years, and bitter experience over that time has now convinced us that it doesn’t work well in all circumstances, and a better plan is needed.

The issue is that essentially all local changes to the repository accrete over time into one massive merge commit. If we’re talking about a sizeable amount of code, and if we’ve made lots of changes to it over time, and if upstream have also made lots of changes to it over time, then re-merging that single merge commit soon becomes horrendous. Basically, the original bow-shaped vendor branch workflow does not scale.

So what to do instead? We need to deconstruct that overwhelming single merge commit; we need, ideally, to move from a merging workflow to a rebasing workflow.

The broad outline of the situation is as follows: some time ago we merged an upstream version of some vendor code, but we’ve been hacking on it since. Meanwhile, upstream has produced a newer version whose updates we’d like to take – while keeping our changes, where relevant.

To achieve this, we’re first going to create a new branch, re-import a stock copy of the current upstream version, and then re-apply our changes, one by one, so that the tip of that branch exactly matches the tip of master.

So far, this sounds like a lot of work to achieve literally nothing! But what we now have, is a branch containing just our changes, in a sequence of individual commits. In other words, it’s exactly what’s needed in order to rebase those commits on top of a newer upstream release. So, starting from our re-import, we create a second new branch, then import a stock copy of the new upstream (so that there’s a single commit containing all upstream’s changes between the two), and then rebase our re-apply branch on top of that.

The upstream source in the example below is the same WICED SDK as used in the previous blog post; note that although the whole WICED project was sold by Broadcom to new owners Cypress in the meantime, many references to “Broadcom” still remain in our code, notably in the directory names. This ticks all three boxes for needing the heavyweight integration process: it’s large – 7,000 source files – we’ve made lots of changes, and upstream have also made lots of changes.

Here’s exactly what to do. First we need to know which commits, since the last time we took an upstream version, changed the vendor directory – here, thirdpaty/broadcom. Fortunately, git log is able to tell us exactly that. We need to go back in history (perhaps using gitk’s “find commits touching paths” option) and find the commit where we originally took our current upstream version. In the repo I’m using, that’s ac64f159.

The following command logs all commits that have changed the thirdparty/broadcom directory since then:

git log --format=%H --reverse --ancestry-path ac64f159..master -- thirdparty/broadcom
For me, that lists 198 commits! Because of the --reverse, they’re listed in forwards chronological order: in other words, in the order in which we need to re-apply them on our re-merge branch. Let’s put that list of commits in a file, commits.txt:

git log --format=%H --reverse --ancestry-path ac64f159..master -- thirdparty/broadcom > commits.txt

Now we start a new branch:

git checkout -b pdh-wiced-remerge
...and re-import the current upstream version, here WICED-3.5.2:
rm -rf thirdparty/broadcom
cd thirdparty
7zr x WICED-SDK-3.5.2.7z
cd ..
mv thirdparty/WICED-SDK-3.5.2 thirdparty/broadcom
(clean up the source as necessary, fix CR/LFs etc.)
git add -A thirdparty/broadcom
git commit -m"Import stock WICED 3.5.2"
git tag pdh-wiced-3.5.2-stock

Notice that we tag this commit, as we’ll be referring to it again later.

Now we’d like to replay our long list of local commits on top of that release – to get back, as it were, to where we started. The thing to note here is that we can do this in a completely automated way – there should be no chance of merge conflicts, as we’re re-applying the same commits on top of the same upstream drop. It’s so very automated that we can do each one using a script, which I called apply-just-broadcom:

#! /bin/bash
git checkout $1 -- thirdparty/broadcom
git add -u thirdparty/broadcom
git commit -C $1
This says, first checkout (or, really, apply) the commit named in the script’s first argument – but only where it affects the thirdparty/broadcom directory. Any other parts of the commit aren’t applied. This automatically adds any new files, but it doesn’t delete anything deleted by the commit we’re applying – so we delete those using git add -u. Finally we want to commit those changes to our new branch, but using the commit message they originally had – for which, git commit’s -C option is exactly what we want.

Armed with that script, we can then apply, one-by-one, all the commits we identified before:

git checkout pdh-wiced-remerge
for i in `cat commits.txt` ; do scripts/apply-just-broadcom $i ; done
This will churn through the commits, one by one, re-applying them to the pdh-wiced-remerge branch. Because they’re all getting applied in the same context in which they were originally committed, they should all go in one-after-the-other with no conflicts or warnings. (If they don’t, perhaps your re-import of the old upstream didn’t quite match the original import, so fix that and start again.)

And now we should be, quite circuitously, back where we started, with the tip of the pdh-wiced-remerge branch matching master exactly:

git diff pdh-wiced-remerge master
...which should show no differences at all. What you’d see in gitk is something like the image to the right, showing master with your branch coming off it. And the branch contains the (re-)import commit, then all the work that’s been done since. Scrolled way, way off the top of that screenshot, about 150 commits further up, is the tip of the pdh-wiced-remerge branch.

Optionally but usefully, you can now tidy up the branch to make it easier to apply to the new release. For instance, if the branch contains patches that were later reverted, you can use interactive rebase to remove both the patch and the revert, for the same result but avoiding any chance of the patch later causing conflicts. Doing this should still leave no diff between your branch and master.

Even more optionally, but still usefully, I needed at this stage to rewrite a bunch of the commit messages. Those numbers in square-brackets in the commit summaries, are instructions to our git server to link the commits with the related user-story (or bug) in Pivotal Tracker. (The reason they don’t all have one, is that some of the original commits were themselves on bow-shaped branches, and only the merge commit carried the Pivotal link.) I wanted to remove those links, so that pushing my branch didn’t spam every Pivotal story associated with a thirdparty/broadcom commit, some of which were years old by this stage. But there are tons of them, so I wanted to rewrite the messages programmatically. It’s definitely out-of-scope for this blog post, but I ended up using the powerful and not-to-be-trifled-with git filter-branch command, in conjunction with a tiny sed script:

git filter-branch --msg-filter "sed -e 's/\[\#\([0-9]*\)\]/(\1)/g'" master..pdh-wiced-remerge
This monstrosity rewrites every commit message between master and pdh-wiced-remerge, to replace square-bracket Pivotal links with curved-bracket “links” that don’t trigger the automatic Pivotal integration.

Anyway, that aside, we’re now in a position to merge the new WICED release, upstream version 3.6.3. So let’s tag where we are now, so we can refer to it later:

git checkout pdh-wiced-remerge
git tag pdh-wiced-3.5.2-merged

We want to branch our WICED-3.6.3 branch from the stock 3.5.2 re-import, so that there’s a diff that just shows upstream’s changes between 3.5.2 and 3.6.3. So that’s:

git checkout pdh-wiced-3.5.2-stock
git checkout -b pdh-wiced-3.6.3
And now we can do the same dance as before, removing the old thirdparty/broadcom and replacing it with the new one:
rm -rf thirdparty/broadcom
cd thirdparty
7zr x WICED-SDK-3.6.3.7z
cd ..
mv thirdparty/WICED-SDK-3.6.3 thirdparty/broadcom
(clean up the source as necessary, fix CR/LFs etc.)
git add -A thirdparty/broadcom
git commit -m"Import stock WICED 3.6.3"
git tag pdh-wiced-3.6.3-stock

Again, because this is a complete replacement, there’s no chance of merge conflicts.

Now just the rebasing operation itself remains. Because we tagged all the important points, we can use those tag names in the rebase command:

git rebase --onto pdh-wiced-3.6.3-stock pdh-wiced-3.5.2.stock pdh-wiced-remerge

This will be a lengthy rebasing operation, and merge conflicts are likely to occur. These can be fixed “in the usual way” – this isn’t a process usable be people who aren’t already experienced in fixing rebase conflicts, so I won’t say much more about that fixing here. But note that the rebase operation always tells you which commit it got stuck on – so you can go and look at the original version of that commit to check what its intention was, and also at the pdh-wiced-3.6.3-stock commit, containing exactly upstream’s changes between 3.5.2 and 3.6.3, in the hope that you can glean some insight into what upstream’s intention was.

If you’ve previously been using our original vendor-branch scheme, the first of those commits will be the worst trouble to integrate, as it will be the single big-bang merge commit from the previous integration. But at least you can console yourself now that future integrations will be no harder than this – as opposed to the previous scheme where they just kept getting harder.

Once you’ve completed the rebase, the repository will look basically as it is on the right. Notice that the rebase operation has “moved” the branch pdh-wiced-remerge so that it’s now on top of the 3.6.3 merge; it’s not still where the pdh-wiced-3.5.2-merged tag points to. (Rebasing always moves the branch that you’re rebasing; it never moves any tags.)

Now you get to build and test the pdh-wiced-remerge branch; it’s likely that it currently doesn’t even compile, and here is where you make any major changes needed to deal with the new drop. (Minor, textual changes may have already been fixed during the rebase process.) Add any new commits as necessary, on the pdh-wiced-remerge branch, until everything builds, runs, and passes all its tests. This may or may not be particularly arduous, depending on the nature of the changes made upstream. (But either way it’s still less arduous than the same thing would have been with the merging workflow.)

Now all that remains is to merge the pdh-wiced-remerge branch back to master as a normal bow-shaped branch:

git checkout master
git merge --no-ff pdh-wiced-remerge

And you should end up with a repository that looks like the one at left: a bow-shaped branch containing, in order, the re-imported 3.5.2; the newly-imported 3.6.3; the rebased versions of all your previous fixes; and any new fixes occasioned by 3.6.3 itself.

Notice that the replayed fixes leading up to the tag pdh-wiced-3.5.2-merged don’t appear in the branch as finally merged, but that the stock WICED 3.5.2 commit does. This is probably what you want: it’s important to have a commit that denotes only the changes made upstream between releases, but it’s not very important to record the replayed fixes – after all, those commits were generated by a script to start with, so the information in them can’t be that significant.

So now 3.6.3 is merged. And when the next upstream release comes along, you get to repeat the whole process.

But what if several upstream releases have happened since you last took one? There are two options: either you do the above process once, taking the latest release – or, you do it repeatedly, once for each intermediate release. The latter is the best option if widespread changes have occurred, as it lets you fix each conflict separately, rather than all in one go. And in fact if you know you’ve got several upstream releases to merge, you don’t have to bother merging back to master each time (especially if you too have 198 commits on your remerge branch): you can keep branching and rebasing and then merge the final result to master in one go, as seen on the right. The branch, as finally merged, would contain, in order:

  • stock 3.5.2
  • stock 3.6.3
  • stock 3.7.0
  • ... further upstream releases ... (red)
  • fixes from 3.5.2 and earlier (pink)
  • fixes from 3.6.3 (pale blue)
  • fixes from 3.7.0 (mid-blue)
  • ... further sets of fixes ... (green)

All of which means that the tip of that branch would be the latest upstream version, with all relevant fixes applied. Which is what’s needed to be merged.

Friday, 14 September 2018

Enable and disable methods are an anti-pattern

Here's a C++ API for an abstracted I2C master peripheral, such as might be found on a microcontroller – an API which I invented:

class I2C
{
public:
    virtual ~I2C() {}

    virtual void setClockSpeed(ClockSpeed speed) = 0;
    virtual void enable() = 0;
    virtual void disable() = 0;
    virtual ssize_t read(uint8_t target, const uint8_t *subAddress,
                         size_t subAddressBytes, uint8_t *buffer,
                         size_t bufferBytes) = 0;
    virtual ssize_t write(uint8_t target, const uint8_t *message,
                          size_t messageLength) = 0;
};

The details of how an I2C bus actually works aren't so relevant to the current issue, so I've removed the comments. What's interesting here is what the calling code needs to look like: the caller has to set the desired clock-speed, enable the peripheral, read and/or write using it, and then disable it again.

This code has been happily chugging away inside Electric Imp devices of various types for about seven years now. Which is why it comes as such a surprise to me that I now think it's terrible.

The realisation came after spending some time learning the Rust language, whereby I've been trained into caring a lot about object lifetimes – because the compiler actually rejects code containing use-after-free errors and whole swathes of further bugs of that sort. I've seen bugs we diagnosed via in telemetry reports from crashing units in production, which would have been caught as compile-time errors in Rust. So Rust is great, once you've got your head round it; but in fact it's so great that I now find myself writing better C++ because of the experience of writing some Rust.

From an object-lifetime point-of-view, the API listed above is error-prone and so undesirable. It's too easy for client code to get things wrong: calling read() before you've called enable(), calling read() after you've called disable(), or calling setClockSpeed() – er, wait, when should you call setClockSpeed()? Is it only allowed before enable? Or is it only allowed afterwards? Or can it be in either position? What guarantees on that point are offered by implementations of this interface?

An act of API design should be an act of humility; you should, to a certain extent, be aiming to make other people's lives easier at the expense of making your own life harder. I failed to do that here. Offering a read() method that just returns ERR_NOT_ENABLED if you haven't called enable() yet, is just leading callers down the garden path. That ordering should be expressed in the API, so that, in the Rust style, it's a compile-time error to get it wrong (or alternatively, in another situation familiar to Rust programmers, so that the error simply isn't expressible in the first place).

And how can that be arranged? By not giving the caller access to the read() method until they've called the enable() method. What if we remove the read() and write() methods from class I2C altogether, and put them in a "handle" class which callers can't get hold of except by enabling the peripheral? This ends up looking like the API below:

class EnabledI2C
{
public:
    virtual ~EnabledI2C() {}

    virtual ssize_t read(uint8_t target, const uint8_t *subAddress,
                         size_t subAddressBytes, uint8_t *buffer,
                         size_t bufferBytes) = 0;
    virtual ssize_t write(uint8_t target, const uint8_t *message,
                          size_t messageLength) = 0;
};

class I2C
{
public:
    virtual ~I2C() {}

    virtual std::unique_ptr<EnabledI2C> open(ClockSpeed cs);
};

The disable() method has disappeared altogether; when the calling code has finished with its I2C controller, it can just reset or destroy the unique_ptr, and the destructor of EnabledI2C (or rather, the destructors of its subclasses) can do the work of disable(). Likewise, the work of enable() can be done in the constructor, bringing some RAII to the situation and making the implementations, as well as the calling code, less error-prone.

Without a method called "disable", "enable" looked like the wrong name, so it became "open". And this refactoring has also cleared up the ambiguity around setClockSpeed(); in fact, one implementation of this I2C interface allowed it to be called at any time, even in-between reads, but in another implementation it only took effect at enable() time. In practice, the calling code all expected it to work in the latter way, so we can codify that directly into the new API by incorporating it into the "enable" ("open") call itself, so that it's impossible to call it too early or too late.

And again but in C

C++'s support for RAII definitely adds both neatness and robustness to the code here. But part of the principle – that where API calls must be ordered or nested in order to work properly, that ordering or nesting should be enforced by the code rather than meekly whined-about by its documentation – applies even in a C version of the same API:

struct I2C; // opaque
struct EnabledI2C; // opaque

EnabledI2C *I2C_OpenEnabled(I2C *i2c, unsigned clockSpeed);

ssize_t EnabledI2C_Read(EnabledI2C *ei2c, int8_t target,
                        const uint8_t *subAddress,
                        size_t subAddressBytes, uint8_t *buffer,
                        size_t bufferBytes);
ssize_t EnabledI2C_Write(EnabledI2C *ei2c, uint8_t target,
                         const uint8_t *message, size_t messageLength);
void EnabledI2C_Close(EnabledI2C **pei2c);

Without RAII, the "close" corresponding to the "open" must be explicit, but at least it's still not possible for the client to call things in the wrong order.

And again but in Squirrel

This part is a bit of a cheat; the Electric Imp Squirrel API isn't actually implemented in Squirrel; rather, it's a set of bindings from Squirrel to the C++ code in the improm. But even so the same pattern can apply; here's (the native-Squirrel equivalent of) what it currently looks like:

class I2C {
    function configure(clockSpeed) { ... }
    function read(deviceAddress, registerAddress, numberOfBytes) { ... }
    function write(deviceAddress, registerPlusData) { ... }
    function disable() { ... }
};

This API is used by our customers left, right, and centre, and we can't ever change it. (We could extend it, perhaps.) But if we had our time over again, I'd have done it like this:

class I2C {
    function open(clockSpeed) { ... return EnabledI2C(...); }
};

class EnabledI2C {
    function read(deviceAddress, registerAddress, numberOfBytes) { ... }
    function write(deviceAddress, registerPlusData) { ... }
};

As in C++, the disable() call has disappeared, because it's all done in the destructor of the EnabledI2C object, which is called when the Squirrel reference-counting system determines that the object is no longer being used. The calling code might look like this:

local e = hardware.i2cAB.open(100000);
e.read(...);
e.write(...);
e = null;

Alternatively, it could just rely on e going out of scope at the end of the function, instead of explicitly setting it to null. But the cheating here has intensified: native Squirrel does not include destructors, and so doesn't allow for RAII. However, Squirrel objects that are bindings to other languages – such as the C++ bindings used by the Electric Imp API – can have destructors using Squirrel's sq_setreleasehook mechanism, and so can be used with RAII.

About Me

Cambridge, United Kingdom
Waits for audience applause ... not a sossinge.
CC0 To the extent possible under law, the author of this work has waived all copyright and related or neighboring rights to this work.