Whitebait, Kleftiko, Ekmek Special: 2009

Monday, 16 November 2009

This Code Is Completely Bogus

Here’s some rubbish code I wrote:

class Connection: public util::Stream
{
    ...

public:
    class Observer
    {
    public:
 virtual ~Observer() {}

 virtual unsigned OnHttpHeader(const std::string& key,
          const std::string& value) = 0;
 virtual unsigned OnHttpData() = 0;
 virtual void OnHttpDone(unsigned int error_code) = 0;
    };

private:
    explicit Connection(Observer*);
    friend class Client;

public:
    ~Connection();

    unsigned int OnActivity();

    // Being a Stream
    unsigned Read(void *buffer, size_t len, size_t *pread);
    unsigned Write(const void *, size_t, size_t*) { return EINVAL; }
};

typedef boost::intrusive_ptr<Connection> ConnectionPtr;

class Client
{
public:
    Client();

    /** Passing a NULL verb means POST (if !body.empty()) or GET (otherwise).
     */
    ConnectionPtr Connect(util::PollerInterface *poller,
     Connection::Observer *obs,
     const std::string& url,
     const std::string& extra_headers = std::string(),
     const std::string& body = std::string(),
     const char *verb = NULL);
};

It’s meant to be an HTTP client implementation, where the Connect() call returns a smart pointer to a connection object that manages its own lifetime. The Connect() call sets it all off, and attaches it to the PollerInterface object, which calls back into the Connection object (the OnActivity method) whenever the socket becomes writable (or readable, depending on where we are in the http transaction).

I mean, just look at it: it’s obviously completely bogus. [Short pause whilst you just look at it.]

Actually, no. It wasn’t at all obvious to me just from looking at it, that it was completely bogus. The only point, in fact, at which it became obvious that it was completely bogus, was when it started causing bugs.

And the bugs it caused were quite awkward ones: crashes (and Valgrind violations), obviously timing-related, to do somehow with Connection pointers still being used after the object had gone away — which should really have been disallowed by the smart-pointer class.

It turned out that the problem was if the transaction completed too quickly. At the time the code was written, the PollerInterface object didn’t own (smart pointers to) the pollable items hung off it. So, in order to stop the Connection object disappearing, due to the final reference going away, during a call to OnActivity, OnActivity itself creates and holds an additional reference to the Connection object during its execution. But if the transaction got started quickly enough, the first call to OnActivity would happen before the constructor returned — in other words, before the pointer had been assigned to the smart-pointer that’s the result of Connect(). So the “additional” reference held inside OnActivity would be the only reference — and when it went away at the end of the function, there’d be no outstanding references and the object would be deleted.

The effect was as if the constructor had included “delete this” — resulting in the new call in “p = new Connection” returning a pointer that was already dead and dangling before even being assigned to p.

Completely bogus. And worse, a completely bogus design; given the constraints, there was nothing that could possibly be done in the methods of the classes Client and Connection that would correctly implement that interface. The interface itself needed to change. Fortunately, I decided it didn’t need to change much:

class Connection
{
    ...
public:

    /** Start the HTTP transaction.
     *
     * Immediate errors (failure to parse host, failure of connect()
     * call) are returned here; errors happening any later come back
     * through Observer::OnHttpDone. In fact, OnHttpDone may be called
     * before Init() returns, i.e. you need to be ready for OnHttpDone
     * calls before you call Init(). If Init() returns a failure,
     * OnHttpDone has not been called, and is guaranteed not to be
     * called afterwards. Otherwise, it's guaranteed it WILL be
     * called.
     */
    unsigned int Init();
    ...
};

So you get a completely inert ConnectionPtr back from Client::Connect, and only then — once you’ve safely squirreled-away the smart-pointer — do you light the blue touch-paper by calling Connection::Init().

But although this version at least makes it possible to use the functionality correctly, it doesn’t really make it easy — and that should be the real goal of any act of API design. There’s that uneasy division of labour between Connect() and Init() — methods on two different classes — and there’s a whole paragraph of complex object-lifetime issues to read and understand (or, as they’re otherwise known, bugs waiting to happen) for users of Init().

This, and particularly the lifetime of the Connection::Observer object, which usually ends up having to call “delete this” in its OnHttpDone() method, left me with one of those code itches that tells me that I (and, the largely theoretical, other users of this API) am writing more complex and icky client code than I should be.

Neatening this up required making the object-lifetime issues more sane, which in turn involved greater use of smart pointers. (Not quite Wheeler’s Law, because there was already a use of indirection, and the change involved only a strengthening from reference to ownership.) In the next release of Chorale, the PollerInterface has been replaced by a Scheduler, which keeps smart pointers to its pollable items, allowing the HTTP client API to be simplified to this:

class Connection: public util::Stream
{
public:
    virtual ~Connection() {}

    /** Called with data from the returned HTTP body (if transaction
     * succeeds).
     */
    virtual unsigned Write(const void *buffer, size_t len, size_t *pwrote);

    /** Called with each incoming HTTP header (if connection succeeds).
     */
    virtual void OnHeader(const std::string& /*key*/, 
     const std::string& /*value*/) {}

    /** Called with the overall result of the connection/transaction attempt.
     */
    virtual void OnDone(unsigned int error_code) = 0;
};

typedef util::CountedPointer<Connection> ConnectionPtr;

class Client
{
    class Task;

public:
    Client();

    /** Passing a NULL verb means POST (if body != NULL) or GET (otherwise).
     */
    unsigned int Connect(util::Scheduler *poller,
    ConnectionPtr target,
    const std::string& url,
    const std::string& extra_headers = std::string(),
    const std::string& body = std::string(),
    const char *verb = NULL);
};

So the Scheduler owns the (unseen to the library user) connection Task objects, and the connection Task objects own the Connection stream target objects. Connect() can return immediate errors from the socket ::connect call or from URL parsing, while deferring any other errors to come back through a later OnDone callback — all without there being any ambiguity of the lifetime of the streams or their observers.

(There would be a problem if the Connection object also had a smart pointer to the Task object, as then neither would ever be deleted and both would become memory leaks. But, because the data is pushed from the Task to the Connection, the Connection never needs to see the Task object — and indeed can’t, because Tasks live and die entirely inside library code, and users of the library can’t even obtain pointers to them.)

Simplicity does not precede complexity, but follows it

—attributed to Alan Perlis

So here’s the thing. The third design is clearly better than the second. Well, it’s clearly better than the first, too, but mainly because the first doesn’t work, which is a boring and trivial way for one design to be better than another. The interesting thing is that it’s better than the second.

And better it certainly is — making this change halved the size of the code that uses the library, as well as making it more intentional and less fragile and stylised.

So why, having got to the second design, was I not satisfied? Why did I carry on thinking about it, waiting for the inspiration of the third design to strike? And why, having come up with the third design, was there a feeling of happiness that wasn’t present when writing the the second one, even when it passed all the unit tests the first one failed?

The only answer I can come up with is to theorise the existence of an almost aesthetic sense of code quality — which is worrying in a couple of ways. Firstly, because what is instinctive is rarely communicable, and what is not communicable is soon lost: a software engineer’s fate is that of the wheelwright in the old story of Duke Huan.

But worse than that: if code quality is in fact an aesthetic, and thus extrarational, experience, then it raises the prospect that others, even other good software engineers, could have a different sense of aesthetics, ultimately resulting in a point where you and they are pulling the same code in opposite directions. (I heard recently of a software organisation, believers in the currently-fashionable “agile”, “refactor-mercilessly” style of development, in which two otherwise talented engineers spent all their time rewriting each others’ code rather than pushing things forward — as their aesthetic senses, and frankly their assumptions about who was “in charge” in the deliberately un-micro-managed environment, butted heads.)

No aesthete could get away with “correcting” the second design above into the first: the failing unit tests would prevent that. But are there those who would correct the third design into the second, in the opposite direction to me? If so, why? And, even more importantly, if not, why not?

Wednesday, 14 October 2009

Productivity Gains With KDE4

Honesty’s a good thing, usually. In particular, it’s usually a good thing in software engineering, in which most of what we do is digital and repeatable, is either one thing or the other; this fosters a culture of honesty in the same way that it does (or should do) in science, as Feynman points out in Cargo Cult Science. Unlike in a courtroom, or even in a courtroom drama, software engineering rarely comes down to one person’s word against another. (Well, unless your co-workers are sociopaths.)

But there are situations where more honesty isn’t a Good Thing. One example, in fact, is a courtroom drama: if you’re writing such a thing, the principle of pure honesty would have you title it something like “Not Guilty of Murder” — whereas, in fact, letting people know the verdict before they’ve seen the piece robs it of its whole point.

Now KDE3 came with a desktop toy called KPat, a patience game which (among others) includes an implementation of the same “Klondike” patience found in Windows Solitaire. The KDE3 (3.5.10) version of KPat had a feature where, if it detected you’d got into a situation where it was impossible to complete the hand, it would stop the game and tell you so. Naturally, such a feature has to be extremely conservative: it must stop the game only when it can be proved that forward progress is impossible. And in fact the algorithm in KDE3 KPat was very conservative indeed: it kicked-in so rarely that seeing the message always came as a surprise, even on games you already knew you’d lost.

Occasionally I’d idly wonder whether the lost-game detection could be improved — but then I realised that actually, if you improved it enough, you’d eventually get to a situation where the game detects, and tells you, the moment you’ve made a move that leads only to dead-ends. “OK, you’re an idiot, bye, next.” There’d be no point playing the game at all.

And yet, in the KDE4 (4.3.2) version of KPat, that’s exactly what’s been implemented. In the status bar, the whole time, is one of two messages — either “Solver: This game is winnable” or “Solver: This game is not winnable in its current state”. Any time you make a move that makes the first message change to the second, you soon reconsider!

And so, alongside the huge amount of work on the graphics (the original neat bitmaps have become huge and flouncy SVG images; the codebase diff is huge even ignoring the Solver; the whole thing has unexpectedly acquired an Ancient Egyptian feel) the developers have completely ruined the actual game. All the time you’re playing, it’s as if a stern examiner is watching over your shoulder, always ready to lean forwards and intone “Now I don’t believe you wanted to do that”. Worse, unlike the KDE3 algorithm, which just summarised information you could already see, the examiner can see the cards that you can’t, making it an eerily omniscient guide, not to mention a shocking cheat.

Solving patience is a great technical achievement. And it’s certainly scrupulously honest to tell the player exactly what the prospects of success are. But it’s a technical achievement that shouldn’t have been achieved (or shouldn’t be present in the game itself, even though it can be turned off), and a case where honesty is definitely not the best policy. And one of the best little time-wasters in KDE has, effectively, been eliminated.

Saturday, 12 September 2009

Recursive Make Is A Time-Waster

Now you’ve arranged to divide your project into libraries, how do you go about compiling them?

Clearly you want to be able to do the whole thing in one go: to do otherwise would be to fail the Joel test. And you want all the dependencies to be checked every time: there’s no bug harder to find than one that no longer actually exists in the source you’re looking at. (And stale objects, not updated following a header change, can lead to ODR violations: in other words, to random spookiness and borkage that the compiler and linker aren’t required to warn you about, and in many cases can’t even theoretically do so.)

GNU Automake is one answer to the problem, and makefiles in the same spirit as the ones it generates are popular even in projects that don’t use Automake. Such makefiles express inter-module dependencies, such as that of an application on a library, by recursively invoking make in the library directory before invoking it again in the application directory.

For a variety of reasons, documented once and for all in the famous paper Recursive make Considered Harmful, this is a bad idea. The paper suggests that for some systems, it’s even a bad idea for correctness reasons — that incorrect builds with inadequately-followed dependencies can result. But the example given is that of a system whose module dependencies can’t be serialised; this naturally means Automake’s sequential run over the sub-Makefiles can’t do the Right Thing. However, if the module dependencies can’t be serialised, that means there’s a cycle and they don’t form a true hierarchy; that’s a bad situation for more reasons than just Automake’s — bad for reusability, bad for unit-testability, and bad because it reveals that the system designers haven’t really thought through what the rôles of the modules are.

So if that doesn’t apply to you, if your modules are properly factored into a neat hierarchy, does that mean there’s less incentive to ditch a recursive-make scheme and take the time to write a non-recursive one? Less, perhaps, but decidedly not none — because there are substantial performance benefits from whole-project makefiles on modern systems.

This effect is, to be fair, mentioned in Recursive make Considered Harmful (7.1), but the author didn’t draw it out on pretty graphs, nor quantify the improvements, so there is scope for it to be better-documented.

Suppose your project has ten components (nine libraries and the application, say), each with between four and twenty source files. These source files won’t, naturally, all take exactly the same length of time to compile — in fact, typically there’ll be quite a wide range of compilation times, especially if only certain files include certain rather profligate library headers. And further suppose that the machine you’re compiling on has eight CPU cores or threads (such machines are desktop-class these days), so you use make -j8.

If your makefiles are like Automake’s, what happens is that make will make the first library using up to eight CPUs, then, once that’s done, make the second library using up to eight CPUs, and so on until it’s done all of them in the serialised order you used in the top-level Makefile. Which is correct, but it’s not efficient, because make needlessly serialises the entire of each library build against the entire of the next one, when typically a library’s components will each depend on only some of the previous one’s targets.

In graphical form, your CPU utilisation would look like the following diagram, where time runs from left to right, each horizontal row represents a different CPU, and blocks of the same colour belong to the same library; the top half of the diagram shows CPU utilisation during a recursive-make-driven recompilation, and the bottom half the same recompilation done with a whole-project Makefile.

click for full-size

Using one Makefile has allowed make -j8 to keep all eight CPUs busy the whole time — as, whenever a CPU comes free, any pending work from anywhere else in the project can be scheduled on it. By contrast, the top half of the diagram has lots of whitespace, where all the CPUs must wait each time until the last one in that library has finished.

In this example, the build takes scarcely 70% of the time a recursive build would do. If you have more cores, perhaps in a multi-machine build farm, the benefit goes up: building on four cores sees a reeduction only to 88%, but 12 cores sees the whole-project build take just 59% as long as the recursive version, and on 16 cores the time is better than halved. If, in the limit, you work for Dr. Evil, your entire company is basically one big compute surface and you can use “make -j one-meelion”, then the improvement factor is essentially the same as the number of libraries — in this case, 10x, though in practice if you went down that route you’d find communication overheads starting to bite into your scalability.

Four Quarters Of The Codebase

As a software project grows, eventually it becomes unwieldy to keep the entire source in one single gigantic folder: you’ll need to split it up. Canonically, and indeed quite correctly, the first idea most architects then have is to split the source into a main program plus one or more libraries that provide ancillary services.

Once that’s done, everyone on the project who sits down to write some new code, will have to think at least for a moment about where to put the source file. The answer, of course, is that the new code should go in the main program if it’s fairly specific to the current project (or if it depends on code that’s already in there), and into the library otherwise. The “force of gravity” — developers’ sense of where code should go unless there’s good reason otherwise — should be in the direction of pulling functionality over into the generic code.

Sometimes this decision seems harder than it should be: if that’s the case, what’s often going on is that the functionality in question should really be split up — with a generic part in the library, and the specific use this project makes of it in the main program. In this way, source code organisation can act as a forcing function for good design.

If that’s so useful, can we get extra design-forcing out of how we organise the source? One issue that often throws a spanner in the works of software designs whose creators didn’t originally bear it in mind, is portability between software platforms. It would be decidedly helpful if the source code organisation, and/or the system architecture it (one hopes) reflects, could help prevent platform specialists from unthinkingly tying code to a particular platform when in fact it’ll be needed more widely in future.

So instead of just application versus library, it’s helpful to organise the codebase so that a developer sitting down to write new code, will have to choose between four different places to put it:

Application-specific, platform-specific	Library code, platform-specific
Application-specific, platform-agnostic	Library code, platform-agnostic

This means that, if you’re writing an application that you’re targetting at, say, both Iphone and Windows, or both desktop and embedded, then the actual application logic needs to go in the south-west corner, where implementations on any platform can use it; user-interface code mostly goes in the north-east, where other applications wanting similar UI components can use it; and only the minimum of code necessary to wire the two together goes in the north-west.

The force of gravity in this diagram acts towards the south and the east: if you can turn code from a Win32 application into generic Win32 components, or into a portable implementation of the logic, that’s a design improvement. If you can split the code up, even split a single class up, along those lines, that’s an improvement too. And by forcing developers to make the four-way choice before they write a source file, you’re helping everyone in the team to think about these issues and to factor the code in the best way as it’s first written. Fostering a feeling of “design guilt”, in yourself or others, when placing functionality further north or west than it belongs, gives everyone an easily-grasped metric for well-factored designs.

Really this just another way of thinking about the model-view-controller pattern; the mapping is straightforward, except that MVC doesn’t have a word for the fourth quarter, libraries which are assumed to be at a lower level than the components MVC deals with:

Controllers	Views
Models	(Libraries)

but, even so, embodying MVC in the source layout itself means that its issues get thought about earlier in the software process.

Now naturally, any real codebase above a certain size won’t consist of just “the” application and “the” library. There’ll be a whole collection of libraries, and perhaps of applications too, arranged in a dependency graph — hopefully, for reasons Lakos expounds, a graph which forms a non-cyclic hierarchy. But it should be possible to partition this graph into (up to) four parts, matching the four quarters of the codebase, and with all dependency arrows pointing southwards, eastwards, or south-eastwards. No view or model should depend on a controller; no platform-agnostic library should depend on any view, model or controller. Again, these are truisms of MVC design, but thinking about them in their four quarters gives an easy visual way of thinking about and discussing the issues.

Incidentally, my own Chorale codebase currently (as of v0.15) fails this test, or at least gets a technical pass only: it doesn’t have any platform-specific library code or directories, as all that code is rather bogusly stuffed in the application-specific, platform-specific directories. (Some classes are, in fact, platform-specific library classes, but the split isn’t enforced in the source organisation, either by libraries or by directories.)

Thursday, 16 July 2009

Little Libraries

How small can you make a library?

Of course, one way to make it smaller is to make each individual file smaller: by using an optimising compiler, by turning off exceptions, by removing debug information (in release builds). But even once you’ve done that, there’s extra overhead involved in just being a library: how can that be minimised?

One choice that affects the answer, is whether you want a shared or a static library. Sometimes other considerations force the answer, but if you have the luxury of being able to choose either type, which do you choose for best compactness?

Both types have their advantages; static libraries are useful when the client, or most clients, use only a fraction of the library facilities: unused objects from the archive simply aren’t linked. By contrast, client code that uses any one facility from a shared library must link all of it. And the position-independent code (PIC) techniques needed to build a shared library, may be more expensive than normal code on some architectures. On the other hand, shared libraries offer control over symbol visibility, so an internally-complex library with a simple interface, can end up simple rather than complex at link-time. And, because a shared library is all-or-nothing anyway, it can be built as a single translation unit with no loss of generality — enabling better compiler optimisation.

So I tried it. I took libupnpd from Chorale, which is a small-to-medium sized library although with very few entry points, and tried to make it as small as possible while retaining its identity as a separate library. Here are the sizes it came out as, in the various attempts, as reported by size(1):

	text	data	bss	total	relative		text	data	bss	total	relative
	amd64-linux						arm-linux
Static library	161443	8	94	161545	0.0%	-23.9%	140176	4	872	141052	0.0%	-24.1%
Shared library	205846	6216	112	212174	+31.3%	0.0%	177469	7968	412	185849	+31.8%	0.0%
Shared library, sec	205844	6216	80	212140	+31.3%	0.0%	177421	7960	248	185629	+31.6%	- 0.1%
Shared library, vis	144726	5328	112	150166	- 7.0%	-29.2%	120012	7472	412	127896	- 9.3%	-31.2%
Shared library, sec vis	144078	5328	80	149486	- 7.5%	-29.5%	119452	7472	248	127172	- 9.8%	-31.6%
Shared library, whole sec vis	133582	5312	80	138974	-14.0%	-34.5%	107623	7468	412	115503	-18.1%	-37.9%
Single object, whole	101011	3112	94	104217	-35.5%	-50.9%	85112	6688	380	92180	-34.6%	-50.4%
Single object, whole sec	100638	3112	94	103844	-35.7%	-51.1%	85112	6688	380	92180	-34.6%	-50.4%
Single object, whole sec vis	99775	3112	94	102981	-36.2%	-51.5%	85120	6688	380	92188	-34.6%	-50.4%
Single object, whole sec vis wp	99287	3112	94	102493	-36.6%	-51.7%	85028	6688	380	92096	-34.7%	-50.5%

Some explanation is, of course, in order.

1. Static library

Each file compiled separately, archived with ar.

2. Shared library

Standard ELF shared library: each file compiled separately with -fPIC, linked with gcc -shared. (Or, in fact, using libtool for simplicity. But that’s what libtool was doing behind the scenes.)

3. Shared library, sec

As (2), but compiled also with -ffunction-sections -fdata-sections, and linked with --gc-sections. This made approximately one gnat’s crotchet of difference.

4. Shared library, vis

As (2), but compiled also with -fvisibility=hidden -fvisibility-inlines-hidden, and with the small number of entry points labelled __attribute__((visibility("default"))). This should indeed have been much better than (2) or (3), but it came as a surprise to me that it’s also better than (1).

5. Shared library, sec vis

As (2) but with both the (3) and (4) optimisations applied. Again, --gc-sections makes a non-zero but insignificant difference.

6. Shared library, whole sec vis

OK, now we’re getting somewhere. This is as (5), but with the whole library built as a single translation unit. It’s the moral equivalent of GCC’s -combine, but as that only applies to C and this is C++, it’s done by writing a small .cpp file that does nothing but #include all the other .cpp files. (Doing this requires a certain discipline in the files in question, so as not to queer the pitch for subsequent code. But it’s certainly not an intolerable imposition.)

In the interests of scrupulous accuracy, I should point out that in fact not the whole of the library is compiled in the one file. One of the component parts needs special compiler options, so that one’s still compiled separately.

These settings correspond to what you get if you configure and build KDE3 with --enable-final.

7. Single object, whole

This is like the non-shared-library version of (6); the whole library is compiled in a single translation unit, but this time into an ordinary, non-PIC, object file, which is then put in a library by itself.

8. Single object, whole sec

As (7), but in little sections as per (3). This doesn’t make much difference on amd64-linux, and none at all on arm-linux.

9. Single object, whole sec vis

As (8) but with the visibility settings too. As those settings are only meant to apply to shared libraries, this shouldn’t have made a difference. But for some reason, it did, albeit a tiny one.

10. Single object, whole sec vis wp

As (9) but with the single library translation unit compiled with -fwhole-program, and the exported functions labelled __attribute__((externally_visible)).

Semantically and philosophically, this is saying much the same thing as the visibility settings. But there must be some extra little bit of optimisation the compiler can do in this situation. There isn’t a corresponding statistic for a shared library with -fwhole-program, as using that option disables PIC.

The effects of these various optimisations are remarkably similar, in relative terms, across both architectures.

Why size(1) Isn’t The Full Story

Especially considering the library is C++, some of the numbers above need to be taken with a slight pinch of salt. Most C++ programs have a lot of functions declared in header files; functions which are semantically inline but which, for various reasons (such as being listed in a vtable) aren’t actually purely inlined in practice. Such functions are emitted in every object file, in “link-once” sections, which are deduplicated by the final linker and emitted only once each.

The size(1) command counts these link-once sections under “text”, and doesn’t do any deduplication. So a static library made of many object files will often have a larger size (as quoted by size(1)) than it will actually add to the final binary — because of the multiple copies of all the link-once sections. So row (1) of the table is somewhat inflated — and in fact the “single object” rows are very slightly inflated, too, as although they only contain one each of the “link-once” sections, those will still get deduplicated again against any that are also used elsewhere in the final binary. In shared libraries, on the other hand, link-once sections are deduplicated when the shared library itself is linked, but can’t be deduplicated again at dynamic link time. So those numbers really do represent the amount that gets added to the final application footprint.

Interestingly (and fortunately, seeing as libupnpd isn’t actually the whole program), using -fwhole-program still emits the link-once sections, so they do get deduplicated against the final binary.

Why, If You Really Want Small, You Should Ignore All This

As the massive leap in density between the shared-library and single-object numbers in the graph show, the compiler can do a lot better the more of your program it sees at once. So for the very smallest binaries, take this principle and turn it up to eleven: abandon (for release builds) the idea of using separate libraries at all, and compile your entire program and all its “library” code as a single translation unit.

Not that that’s a good idea if the library code itself is part of the product: if you don’t install the library, no other clients can link against it, and if you install the library but your program doesn’t use it, that’s wasted RAM when other processes load the library copies of all that code. But for “libraries” tightly wedded to a single binary, and especially for embedded systems (where typically there are no other processes), it does get you better code size than even the best options above.

Wednesday, 17 June 2009

Type Erasure, boost::mutex, And The617

The boost::mutex class is certainly not considered harmful. It’s jolly good, especially the way it, and its companion boost::condition, work properly on Win32. But it doesn’t half include a lot of header files to obtain all that jolly goodness.

Let’s get GCC to tell us what header files it includes. To do this, we need GCC’s dependency generation features, the ones normally used for generating Makefile fragments listing all the files a translation unit depends on.

$ echo "#include <boost/thread/mutex.hpp>" \
      | g++ -xc++ -E - -MP -M | grep ^/usr | sort
...long list...

OK, that’s indeed quite a lot of headers. How many exactly? (using GCC 4.3.3, amd64-linux, Boost 1.39.0)

$ echo "#include <boost/thread/mutex.hpp>" \
      | g++ -xc++ -E - -MP -M | grep ^/usr | sort | wc -l
618

Yes, a source file that includes nothing but <boost/thread/mutex.hpp> has actually also read six hundred and seventeen other headers by the time it’s done.

Now I’d hate to be one of those bloggers who does nothing but sit around and grouse about the people who are actually doing useful work. But I humbly submit that having to include 618 C++ headers just to get hold of a mutex class, is, in an objective sense, rather a lot.

Really what’s wanted is one of the “complete insulation” or type-erasure techniques from Lakos section 6.4, so that instead of the situation on the left — where every file in the project which needs a mutex, includes the Boost header and its 617 dependencies — we get the situation on the right, where only the encapsulating file needs to know the details, and everyone else can just include the simple leaf header my_mutex.h.

The problem is, every encapsulation technique discussed by Lakos has an inevitable cost in efficiency. Whichever way you slice it — whether you use a mutex protocol class plus a factory, or whether you use the handle/body or pimpl pattern — you tend to end up needing to call new and delete, not just every time you create or destroy a mutex, but every time you create or destroy a mutex scoped-lock object. Especially for rarely-contended locks, that overhead can easily swamp the actual implementation time.

The issue is, that to end up with a boost::mutex, client code must either construct one itself — depending on boost::mutex “in size” is Lakos’s term — or call other code which hands one back, which latter scheme necessarily requires an allocation.

So how can we solve this impasse? How can we depend on boost::mutex “in size”, without depending on all its headers?

Well, we can cheat. We can populate my_mutex.h with classes whose objects are the same size as boost::mutexes — and which, within the file my_mutex.cpp, are boost::mutexes — but whose actual type is invisible to other clients of my_mutex.h. The idea is to behave a bit like this:

class Mutex
{
    char m_data[sizeof(boost::mutex)];

public:
    Mutex()
    {
        new(m_data) boost::mutex;
    }
    ~Mutex()
    {
        ((boost::mutex*)m_data)->~mutex();
    }
};

though of course that doesn’t yet help as-is, as you still need to include the 617 to get sizeof(boost::mutex) to compile.

To eliminate the dependencies, you need to know sizeof(boost::mutex) ahead of time. Probably the best way to do this, is to get your configury to work it out, by adding lines like these to your configure.ac:

AC_LANG_PUSH([C++]) AC_CHECK_SIZEOF([boost::mutex],, [[#include <boost/thread/mutex.hpp>]]) AC_CHECK_SIZEOF([boost::condition],, [[#include <boost/thread/condition.hpp>]]) AC_CHECK_SIZEOF([boost::mutex::scoped_lock],, [[#include <boost/thread/mutex.hpp>]]) AC_LANG_POP

which will leave you with lines like this in your config.h:

#define SIZEOF_BOOST__CONDITION 88 #define SIZEOF_BOOST__MUTEX 40 #define SIZEOF_BOOST__MUTEX__SCOPED_LOCK 16

giving you exactly what you need to write my_mutex.h. (One day when you want to see something deliciously evil, go and look at the way Autoconf determines such size information when cross-compiling, without depending on any particular compiler or linker.) The result looks like this:

/* my_mutex.h */
#include "config.h"

class Mutex
{
    char m_data[SIZEOF_BOOST__MUTEX];

public:
    Mutex();
    ~Mutex();
};


/* my_mutex.cpp */
#include "my_mutex.h"
#include <boost/thread/mutex.hpp>

Mutex::Mutex()
{
    new(m_data) boost::mutex;
}

Mutex::~Mutex()
{
    ((boost::mutex*)m_data)->~mutex();
}

which is indeed good enough to free client code from the burden of the 617.

But if you’re doing this wrapping for more than one class — Chorale wanted at least boost::mutex, boost::mutex::scoped_lock, and boost::condition — you start to realise you’re writing the same thing many times over, and that it really ought to be wrapped up in a template. (An attempt is also made to avoid alignment issues, a worry which might already have been nagging at you after the above snippets; an assertion checks that the configury got the right answer for the size.) Here it is:

template <class T>
struct WrappedType;

/** Wrap up a type so that even clients who depend on it in size,
 * don't have to see its declaration.
 */
template <class T, unsigned int sz>
class Wrapper
{
    union {
        char m_data[sz];
        void *m_align;
    };

    /** Use a nested class, rather than using WrappedType<T>::type
     * directly, so that we can be sure that its destructor is called
     * "~Wrapped" -- if T1 is a typedef-name, its destructor won't be
     * called "~T1".
     */
    class Wrapped: public WrappedType<T>::type
    {
    public:
        Wrapped() {}

        template <class Arg>
        explicit Wrapped(Arg& arg) : WrappedType<T>::type(arg) {}
    };

public:
    Wrapper()
    {
        BOOST_STATIC_ASSERT(sizeof(Wrapped) == sz);
        new (m_data) Wrapped;
    }

    template <class Arg>
    explicit Wrapper(Arg& arg)
    {
        BOOST_STATIC_ASSERT(sizeof(Wrapped) == sz);
        new (m_data) Wrapped(arg);
    }

    ~Wrapper()
    {
        ((Wrapped*)m_data)->~Wrapped();
    }

    /** Calls to Unwrap() will only compile following a definition of
     * WrappedType<T> -- not in client code.
     */
    Wrapped& Unwrap() { return *(Wrapped*)m_data; }
};

Armed with this, my_mutex.h can be very straightforward:

#include "config.h"
#include "wrapper.h"

/** Wrap a boost::mutex so we don't include so very many headers.
 */
class Mutex: public Wrapper<Mutex, SIZEOF_BOOST__MUTEX>
{
public:
    Mutex();
    ~Mutex();

    class Lock: public Wrapper<Lock, SIZEOF_BOOST__MUTEX__SCOPED_LOCK>
    {
    public:
        Lock(Mutex&);
        ~Lock();
    };
};

class Condition: public Wrapper<Condition, SIZEOF_BOOST__CONDITION>
{
public:
    Condition();
    ~Condition();

    bool Wait(Mutex::Lock&, unsigned int sec);

    void NotifyOne();
    void NotifyAll();
};

and my_mutex.cpp not much less straightforward; note that WrappedType<T> is used like a traits class, in that the intended wrapped type is “revealed” by specialising WrappedType for the particular wrapper type — it’s only following such a specialisation, that the constructor, destructor, or Unwrap() calls will compile:

#include "my_mutex.h"
#include <boost/thread/mutex.hpp>
#include <boost/thread/condition.hpp>

template<>
struct WrappedType<Mutex>
{
    typedef boost::mutex type;
};

Mutex::Mutex()
{
}

Mutex::~Mutex()
{
}

template<>
struct WrappedType<Mutex::Lock>
{
    typedef boost::mutex::scoped_lock type;
};

Mutex::Lock::Lock(Mutex& mutex)
    : Wrapper<Lock,SIZEOF_BOOST__MUTEX__SCOPED_LOCK>(mutex.Unwrap())
{
}

Mutex::Lock::~Lock()
{
}

template<>
struct WrappedType<Condition>
{
    typedef boost::condition type;
};

Condition::Condition()
{
}

Condition::~Condition()
{
}

bool Condition::Wait(Mutex::Lock& lock, unsigned int sec)
{
    return Unwrap().timed_wait(lock.Unwrap(), boost::posix_time::seconds(sec));
}

void Condition::NotifyAll()
{
    Unwrap().notify_all();
}

void Condition::NotifyOne()
{
    Unwrap().notify_one();
}

So this technique is neat if a bit icky (with the casting and the explicit size-checking). Is it worthwhile? For Chorale, the answer was certainly “yes”. Chorale is vigorously multi-threaded, and many parts of the system use mutexes. Using a script a bit like the GCC-based dependency counter from the very top of this post, it turns out that Chorale’s 250ish source files and 250ish headers, had between them nearly 240,000 dependencies. Wrapping up just boost::mutex and boost::condition reduced that number to 115,000 — meaning that more than half of all header files compiled while compile the whole of Chorale, a fairly large and complex program, get compiled solely to satisfy the dependencies of boost::mutex. This startling figure is also borne out by the total time taken to compile (with a cold pagecache):

Before wrapping		After wrapping
real 4m0.304s user 10m33.296s sys 1m38.870s		real 2m39.516s user 6m10.523s sys 1m9.452s

...a 42% improvement. It is just possible that there’s a run-time efficiency impact of this wrapping, as the mutex operations now all get inlined into my_mutex.cpp and not directly into the client code. However, as is the way with these things, it’s really just as likely that this gives a performance improvement — some of these Boost calls inline a lot of code, especially under Win32.

Saturday, 30 May 2009

How Not To Develop On Windows

This is a HOW-TO on writing Windows software, including GUI software, without using a Windows box at all except for final testing. Among other benefits, this lets you develop in the simple, testable, scriptable Linux environment; it means you don’t have to forever check things in and out of source-control to test whether you’ve broken the other platform’s builds; and it’s also handy if you’ve got a powerful Linux box but only a feeble Windows box.

Well, actually, it isn’t really a HOW-TO, in the grand tradition; it’s more of a “WHETHER-TO”. That’s because it doesn’t always go into enough detail to let you reconstruct everything described, but it at least tells you that it’s possible — so that if you’re wondering whether-to try and develop Windows software in this fashion, you can be reassured, before setting off in that direction, that the destination is attainable.

There are a lot of bits and pieces to set up a complete cross-development and cross-test environment. We should all be hugely grateful for the vast amount of development effort put in by the GCC, binutils, Mingw, Wine, Qt and other projects to enable the setup described in this one small blog post.

1. A cross-compiler toolchain targetting Mingw32

apt-get install mingw32-binutils mingw32-runtime mingw32

Alternatively, configure and install GNU binutils for target “i586-mingw32”; install the headers from the mingw32-runtime package (which you can’t build yet); configure and install a Stage 1 cross-GCC with

--target=i586-mingw32 --enable-languages="c" --enable-threads=win32 --disable-libmudflap --disable-libssp --enable-__cxa_atexit --enable-sjlj-exceptions --disable-win32-registry --with-gnu-as --with-gnu-ld

as extra configure arguments; configure and install the w32api package; configure and install the mingw32-runtime package; and finally configure and install a Stage 2 fully-working cross-GCC with

--target=i586-mingw32 --enable-languages="c,c++" --enable-threads=win32 --disable-libmudflap --disable-libssp --enable-__cxa_atexit --enable-sjlj-exceptions --disable-win32-registry --with-gnu-as --with-gnu-ld --disable-libstdcxx-pch --enable-libstdcxx-allocator=new

as configure arguments. Note that, supposing you want everything to live in /usr/local/i586-mingw32, you need to give GCC and binutils “--prefix=/usr/local”, and everything else “--prefix=/usr/local/i586-mingw32”.

Except for using w32api and mingw-runtime instead of glibc, this isn’t that different from how to build a cross-compiler for any other target.

If you want to use the exact same versions of everything I did, it’s binutils 2.19.51.0.2, mingw-runtime 3.14, w32api 3.11, and GCC 4.3.3.

A recently-invented third alternative, which I haven’t tried, is the “official” cross-hosted Mingw build tool scripts, which are available on the Mingw Sourceforge page.

2. A native pkgconfig for your cross-compiled libraries

Cross-compiled libraries for Mingw32 will put their pkgconfig “.pc” files in /usr/local/i586-mingw32/lib/pkgconfig. In order for configure scripts targetting Mingw32 to find them, you’ll need a “cross-pkgconfig” — but one which, like a cross-compiler, is built for the build platform, not the target platform. If it’s named using the target prefix, as if it were part of the cross-compiler — i.e., in our case, i586-mingw32-pkgconfig — configure scripts will use it to determine which cross-compiled libraries are present.

Configure pkgconfig 0.23 with:

--with-pc-path=/usr/local/i586-mingw32/lib/pkgconfig --program-prefix=i586-mingw32-

(yes, that ends with a hyphen).

3. Cross-compiled versions of all the libraries you use

How hard it is to arrange for these, depends a lot on each individual library. In theory all you should need to do is configure the library with “--prefix=/usr/local/i586-mingw32 --host=i586-mingw32”, but in practice very few libraries do the Right Thing with that alone. (Honourable mentions here go to libxml2 and taglib.)

Other things you might have to do to more recalcitrant libraries include: setting CC=i586-mingw32-gcc (and sometimes CXX, AR and/or RANLIB similarly); disabling parts of the library (for libcdio use

--disable-joliet --disable-example-progs --without-iso-info --without-iso-read --without-cd-info --without-cd-drive --without-cd-read

to disable all the example programs) — or, if the worst comes to the worst, actually patching out bits of the library. I had to do that to make taglib compile as a static library.

Boost, as usual, presents the most challenging fight you’ll have with a build system. Here, without further commentary, is the Makefile snippet needed to cross-compile Boost for Mingw32; $(BUILD) is the build directory and $(PREFIX) is where to install the result — /usr/local would match a toolchain built as described above:

cross-mingw32-boost:
        mkdir -p $(BUILD)/cross-mingw32-boost
        tar xjf boost-*bz2 -C $(BUILD)/cross-mingw32-boost
        cd $(BUILD)/cross-mingw32-boost/* \
                && ./bootstrap.sh --prefix=$(PREFIX)/i586-mingw32 \
                        --libdir=$(PREFIX)/i586-mingw32/lib \
                        --includedir=$(PREFIX)/i586-mingw32/include \
                && echo \
"using gcc : : i586-mingw32-g++ : <compileflags>-mthreads <linkflags>-mthreads ;" \
              > jtl-config.jam \
                && ./bjam -q --layout=system variant=release \
                        link=static threading=multi --without-iostreams \
                        -sGXX=i586-mingw32-g++ --without-python \
                        threadapi=win32 --user-config=jtl-config.jam \
                && sudo rm -rf $(PREFIX)/i586-mingw32/include/boost \
                && sudo ./bjam -q --layout=system variant=release \
                        link=static threading=multi --without-iostreams \
                        -sGXX=i586-mingw32-g++ --without-python \
                        threadapi=win32 --user-config=jtl-config.jam install
        for i in $(PREFIX)/i586-mingw32/lib/libboost_*.a ; do \
   sudo i586-mingw32-ranlib $$i ; \
 done
        rm -rf $(BUILD)/cross-mingw32-boost

Again, if you’d like a checklist of successes I’ve had here, then with greater or lesser effort it’s proved possible to make cross-compiled versions of zlib 1.2.3, Boost 1.39.0, libcdio 0.80, taglib 1.5, and libxml2 2.6.30.

4. Wine Is Not an Emulator

Configure and install Wine 1.0.1. Admirably, this just works out of the box, though if your Linux box is 64-bit, you’ll need the 32-bit versions of its dependent libraries installed.

Having got this far, you should have all you need to compile, link, and test Windows programs. Of course, they do have to be Windows programs; Mingw is not Cygwin, and your program needs to be compatible with real, proper Windows including <windows.h>, WSAEventSelect, CreateWindowEx and all that jazz — plus, of course, the Windows text-encoding and file-naming rules.

Indeed, depending on how your project’s unit-tests are set up, you can probably run most of them, too, under Wine. Just arrange for them to be invoked using Wine: instead of “run-test various-args”, execute instead “wine run-test various-args”. In some situations, this alone would justify the effort of setting up a cross-development environment: the ability to know, before checking in code on Linux, that it passes all its tests both on Linux and Windows.

5. Qt

Trolltech’s, now Nokia’s, Qt framework has for a while been a really good way of writing open-source Linux GUI applications without getting bogged down in X11 or other unhelpful toolkits. Originally Qt was only available for free on the X11 platform, but subsequently the Windows (and even MacOS) versions were also made available to the free software community, and more recently still have been relicensed under the GNU LGPL. This makes it not only a good way of writing open-source Linux applications, but both open-source and proprietary applications on Linux and Windows (and, again, MacOS too).

So it would be handy if a cross-compiled version of Qt could be used to write and test Windows versions of Linux Qt applications using Wine. The problem is, Qt’s sources are huge — twice the size of KOffice, three times the size of Firefox, five times the size of GLib+GTK put together — which is enough to put a fellow off trying to cross-compile it.

But fortunately, Trolltech supply a binary installer for the Windows development libraries — an installer which works under Wine. So, download qt-win-opensource-x.y.z.exe and run it under Wine. Pick an installation directory (for instance, /usr/local/i586-mingw32/qt — or, in Mingw-speak, Z:\usr\local\i586-mingw32\qt), and let it install. When it asks for a Mingw installation, give it your cross-compiler’s prefix directory (e.g. /usr/local/i586-mingw32); it’ll moan, but let you ignore the moaning and install anyway (do so).

You then need to arrange for Qt’s pkgconfig files to be available to the cross-compiler. The Win32 installation of Qt doesn’t have pkgconfig files, but you can modify the ones from a native Linux installation of the same version of Qt. To do this, issue the following commands (as root):

# cd /usr/lib/pkgconfig     (Or wherever your existing QtCore.pc is)
# for i in Qt*.pc ; do sed \
   -e 's,^prefix=.*,prefix=/usr/local/i586-mingw32/qt,' \
   -e 's,-I.*/include,-I/usr/local/i586-mingw32/qt/include,' \
   -e 's,-l[^ ]*,&4,' \
 < $i > /usr/local/i586-mingw32/lib/pkgconfig/$i ; done

The three sed replacements fix up the prefix= lines in the .pc files, then fix up stray -I directives in the CFLAGS lines that don’t use the defined prefix, then finally take account of the extra versioning present in the Windows filenames (instead of QtCore.lib, Windows has QtCore4.lib, and similarly across the whole framework).

(It was getting a development version of Chorale’s Qt GUI more-or-less up under Wine, and thus bringing Win32 into its sights for the first time, that prompted the writing of this blog post.)

6. The Promised Land

So there you (hopefully, by now) have it. A well-behaved program, such as Chorale, should mostly configure, build, unit-test, and run its Windows version straightforwardly on a Linux box.

Naturally, you still need to use a real Windows box for final testing — not everything that can happen on a real Windows box can be modelled inside Wine, and nor is everything necessarily as compatible between different Windows versions as you’d hope. But by marginalising Windows out of its own development process until as late as possible, the rest of the development can be eased, indeed accelerated, by only having to develop on a single platform.

Tuesday, 26 May 2009

PathCanonicalize Versus What It Says On The Tin

This post contains MathML mark-up, which may not display correctly on some browsers. (And if you’re wondering how to do MathML in Blogger, see here.)

Here’s what canonicalisation is. You’ve got a set of items, and you can test those items for equality, $x = y$ , but what you actually want to do is test for equivalence, $x ≍ y$ — where equality implies equivalence, but equivalence doesn’t imply equality. What’s needed is a canonicalisation function which maps the equivalence relation onto the equality relation, by mapping all (of each subset of) equivalent items onto a single representative item; a function $f$ such that $f (x) = f (y) \Leftrightarrow x ≍ y$ (plus you want canonicalisation to be idempotent: $f (f (x)) \equiv f (x)$ ).

More concretely, suppose you’re keeping a database of disk files, perhaps to enable searching or browsing. The question is, when do two filenames refer to the same file? You can’t just test them for string equality, as two distinct names might refer to the same file: on Unix, /home/peter/foo and /home/peter/src/../foo are the same file. Plus, symbolic links can be used to make even unrelated-looking names refer to the same file. If your database lists the file by one name, and someone does a query looking for another name, it won’t be found — and worse, the file could get into the database several times under different names, perhaps with conflicting or stale information.

But fortunately, “..”, “.”, and symlinks between them are about the size of it for Unix ways of obscuring the naming of a file, and the standard library comes with a suitable canonicalisation function that will reduce elaborated forms into the single unambiguous original pathname. (In fact, the GNU C library comes with two such functions, as the more-portable one, realpath(), needs a little care in use in order to avoid buffer-overrun attacks; the GNU one, canonicalize_file_name(), does not.) So you canonicalise all filenames as you store them into your database, and make sure to canonicalise all filenames in queries before you look them up, and you’ll get the right matches.

And then eventually you’re going to want to port that software to Windows — whereupon you’ve got a problem.

Indeed, you’ve got a whole little family of problems, because even once you’ve navigated the treacherous waters of the textual encoding of filenames under Win32, there still remain a bewildering variety of ways to refer to the same file:

`D:\Music\Fools Gold.flac`	— Probably canonical
`D:/Music/Fools Gold.flac`	— Slash versus backslash
`D:\MUSIC\Fools Gold.flac`	— Case-insensitive per locale
`D:\Music\FOOLSG~1.FLA`	— MS-DOS 8.3
`M:\Fools Gold.flac`	— After “subst M: D:\Music”
`\Device\HarddiskVolume2\Music\Fools Gold.flac`	— If D: is local
`\\server\share\Music\Fools Gold.flac`	— If D: is a network drive
`\\?\UNC\server\share\Music\Fools Gold.flac`	— Or like this
`\\?\D:\Music\Fools Gold.flac`	— Ultra-long-filenames mode
`\\.\D:\Music\Fools Gold.flac`	— Device namespace
`\\?\UNC\D:\Music\Fools Gold.flac`	— Allegedly
`\\?\Volume{GUID}\Music\Fools Gold.flac`	— Crikey

This whole dismal farrago really calls for a path canonicalisation function. Which is why it’s unfortunate that there isn’t one, and doubly unfortunate that there’s a function called PathCanonicalize() that particularly isn’t one, and not just because it’s spelled with a “Z”. All that PathCanonicalize() does is remove “/../” and “/./” substrings — it’s a purely textual transformation and doesn’t even touch the filesystem. It certainly doesn’t satisfy the “canonicaliser condition”:

$f (x) = f (y) \Leftrightarrow x$ and $y$ are the same file

No, there’s no shortcut for doing it laboriously and textually (nor for having lots of unit tests to cover all those ridiculous cases). The plan is, use GetFullPathName() to turn relative paths into absolute, then repeatedly call QueryDosDevice() to unwind subst’d drive letters, then call GetLongPathName() to get rid of 8.3-ness and canonicalise case, and then finally, if GetDriveType() says it’s remote, use WNetGetUniversalName() to convert the remaining drive letter into a UNC path.

std::string Canonicalise(const std::string& path)
{
    std::wstring utf16 = UTF8ToUTF16(path);

    wchar_t canon[MAX_PATH];

    /** Note that PathCanonicalize does NOT do what we want here -- it's a
     * purely textual operation that eliminates /./ and /../ only.
     */
    DWORD rc = ::GetFullPathNameW(utf16.c_str(), MAX_PATH, canon, NULL);
    if (!rc)
        return path;

    utf16 = canon;

    if (utf16.length() >= 6)
    {
        /** Get rid of \\?\ and \\.\ prefixes on drive-letter paths */
        if (!wcsncmp(utf16.c_str(), L"\\\\?\\", 4) && utf16[5] == L':')
            utf16.erase(0,4);
        else if (!wcsncmp(utf16.c_str(), L"\\\\.\\", 4) && utf16[5] == L':')
            utf16.erase(0,4);
    }

    if (utf16.length() >= 10)
    {
        /** Get rid of \\?\UNC on drive-letter and UNC paths */
        if (!wcsncmp(utf16.c_str(), L"\\\\?\\UNC\\", 8))
        {
            if (utf16[9] == L':' && utf16[10] == L'\\')
                utf16.erase(0,8);
            else
            {
                utf16.erase(0,7);
                utf16 = L"\\" + utf16;
            }
        }
    }

    /** Anything other than UNC and drive-letter is something we don't
     * understand
     */
    if (utf16[0] == L'\\' && utf16[1] == L'\\')
    {
        if (utf16[2] == '?' || utf16[2] == '.')
            return path; // Not understood

        /** OK -- UNC */
    }
    else if (((utf16[0] >= 'A' && utf16[0] <= 'Z')
              || (utf16[0] >= 'a' && utf16[0] <= 'z'))
             && utf16[1] == ':')
    {
        /** OK -- drive letter -- unwind subst'ing */
        for (;;)
        {
            wchar_t drive[3];
            drive[0] = (wchar_t)toupper(utf16[0]);
            drive[1] = L':';
            drive[2] = L'\0';
            canon[0] = L'\0';
            rc = ::QueryDosDeviceW(drive, canon, MAX_PATH);
            if (!rc)
                break;
            if (!wcsncmp(canon, L"\\??\\", 4))
            {
                utf16 = std::wstring(canon+4) + std::wstring(utf16, 2);
            }
            else // Not subst'd
                break;
        }

        wchar_t drive[4];
        drive[0] = (wchar_t)toupper(utf16[0]);
        drive[1] = ':';
        drive[2] = '\\';
        drive[3] = '\0';

        rc = ::GetDriveTypeW(drive);

        if (rc == DRIVE_REMOTE)
        {
            DWORD bufsize = MAX_PATH;

            /* QueryDosDevice and WNetGetConnection FORBID the
             * trailing slash; GetDriveType REQUIRES it.
             */
            drive[2] = '\0';

            rc = ::WNetGetConnectionW(drive, canon, &bufsize);
            if (rc == NO_ERROR)
                utf16 = std::wstring(canon) + std::wstring(utf16, 2);
        }
    }
    else
    {
        // Not understood
        return path;
    }

    /** Canonicalise case and 8.3-ness */
    rc = ::GetLongPathNameW(utf16.c_str(), canon, MAX_PATH);
    if (!rc)
        return path;

    std::string utf8 = UTF16ToUTF8(canon);
    std::replace(utf8.begin(), utf8.end(), '\\', '/');
    return utf8;
}

There are still ways to fool this function: for instance, by exporting a directory as \\server\share1 and a subdirectory of it as \\server\share2 — the client has no way of matching them up. But that’s a pretty pathological case, and it could be easily argued that it’s something you’d never do unless you wanted the shares to appear to be distinct. More seriously, the server for a network drive can be specified by WINS name, by FQDN or by IP address; neither canonicalise-to-IP nor canonicalise-to-FQDN is the Right Thing in all cases. For now I’m sweeping that issue under the carpet.

The one remaining wrinkle is that, unlike earlier versions of Windows, Vista allows “genuine” Unix-like symbolic links. Without doing real testing on Vista, it’s hard to make out from the documentation how the APIs used above behave when faced with such symbolic links. It’s even possible that the new-in-Vista GetFinalPathNameByHandle() call is the answer to all these problems; in which case, this code gets demoted to merely the way to do it on pre-Vista versions.

Whitebait, Kleftiko, Ekmek Special