Not in fact any relation to the famous large Greek meal of the same name.

Thursday 16 July 2009

Little Libraries

How small can you make a library?

Of course, one way to make it smaller is to make each individual file smaller: by using an optimising compiler, by turning off exceptions, by removing debug information (in release builds). But even once you’ve done that, there’s extra overhead involved in just being a library: how can that be minimised?

One choice that affects the answer, is whether you want a shared or a static library. Sometimes other considerations force the answer, but if you have the luxury of being able to choose either type, which do you choose for best compactness?

Both types have their advantages; static libraries are useful when the client, or most clients, use only a fraction of the library facilities: unused objects from the archive simply aren’t linked. By contrast, client code that uses any one facility from a shared library must link all of it. And the position-independent code (PIC) techniques needed to build a shared library, may be more expensive than normal code on some architectures. On the other hand, shared libraries offer control over symbol visibility, so an internally-complex library with a simple interface, can end up simple rather than complex at link-time. And, because a shared library is all-or-nothing anyway, it can be built as a single translation unit with no loss of generality — enabling better compiler optimisation.

So I tried it. I took libupnpd from Chorale, which is a small-to-medium sized library although with very few entry points, and tried to make it as small as possible while retaining its identity as a separate library. Here are the sizes it came out as, in the various attempts, as reported by size(1):

 amd64-linuxarm-linux
 textdatabsstotalrelativetextdatabsstotalrelative
Static library 161443 8 94161545 0.0%-23.9%140176 4872141052 0.0%-24.1%
Shared library 2058466216112212174+31.3% 0.0%1774697968412185849+31.8% 0.0%
Shared library, sec 2058446216 80212140+31.3% 0.0%1774217960248185629+31.6%- 0.1%
Shared library, vis 1447265328112150166- 7.0%-29.2%1200127472412127896- 9.3%-31.2%
Shared library, sec vis 1440785328 80149486- 7.5%-29.5%1194527472248127172- 9.8%-31.6%
Shared library, whole sec vis1335825312 80138974-14.0%-34.5%1076237468412115503-18.1%-37.9%
Single object, whole 1010113112 94104217-35.5%-50.9% 851126688380 92180-34.6%-50.4%
Single object, whole sec 1006383112 94103844-35.7%-51.1% 851126688380 92180-34.6%-50.4%
Single object, whole sec vis 997753112 94102981-36.2%-51.5% 851206688380 92188-34.6%-50.4%
Single object, whole sec vis wp 992873112 94102493-36.6%-51.7% 850286688380 92096-34.7%-50.5%

Some explanation is, of course, in order.

1. Static library
Each file compiled separately, archived with ar.
2. Shared library
Standard ELF shared library: each file compiled separately with -fPIC, linked with gcc -shared. (Or, in fact, using libtool for simplicity. But that’s what libtool was doing behind the scenes.)
3. Shared library, sec
As (2), but compiled also with -ffunction-sections -fdata-sections, and linked with --gc-sections. This made approximately one gnat’s crotchet of difference.
4. Shared library, vis
As (2), but compiled also with -fvisibility=hidden -fvisibility-inlines-hidden, and with the small number of entry points labelled __attribute__((visibility("default"))). This should indeed have been much better than (2) or (3), but it came as a surprise to me that it’s also better than (1).
5. Shared library, sec vis
As (2) but with both the (3) and (4) optimisations applied. Again, --gc-sections makes a non-zero but insignificant difference.
6. Shared library, whole sec vis

OK, now we’re getting somewhere. This is as (5), but with the whole library built as a single translation unit. It’s the moral equivalent of GCC’s -combine, but as that only applies to C and this is C++, it’s done by writing a small .cpp file that does nothing but #include all the other .cpp files. (Doing this requires a certain discipline in the files in question, so as not to queer the pitch for subsequent code. But it’s certainly not an intolerable imposition.)

In the interests of scrupulous accuracy, I should point out that in fact not the whole of the library is compiled in the one file. One of the component parts needs special compiler options, so that one’s still compiled separately.

These settings correspond to what you get if you configure and build KDE3 with --enable-final.

7. Single object, whole
This is like the non-shared-library version of (6); the whole library is compiled in a single translation unit, but this time into an ordinary, non-PIC, object file, which is then put in a library by itself.
8. Single object, whole sec
As (7), but in little sections as per (3). This doesn’t make much difference on amd64-linux, and none at all on arm-linux.
9. Single object, whole sec vis
As (8) but with the visibility settings too. As those settings are only meant to apply to shared libraries, this shouldn’t have made a difference. But for some reason, it did, albeit a tiny one.
10. Single object, whole sec vis wp

As (9) but with the single library translation unit compiled with -fwhole-program, and the exported functions labelled __attribute__((externally_visible)).

Semantically and philosophically, this is saying much the same thing as the visibility settings. But there must be some extra little bit of optimisation the compiler can do in this situation. There isn’t a corresponding statistic for a shared library with -fwhole-program, as using that option disables PIC.

The effects of these various optimisations are remarkably similar, in relative terms, across both architectures.

Why size(1) Isn’t The Full Story

Especially considering the library is C++, some of the numbers above need to be taken with a slight pinch of salt. Most C++ programs have a lot of functions declared in header files; functions which are semantically inline but which, for various reasons (such as being listed in a vtable) aren’t actually purely inlined in practice. Such functions are emitted in every object file, in “link-once” sections, which are deduplicated by the final linker and emitted only once each.

The size(1) command counts these link-once sections under “text”, and doesn’t do any deduplication. So a static library made of many object files will often have a larger size (as quoted by size(1)) than it will actually add to the final binary — because of the multiple copies of all the link-once sections. So row (1) of the table is somewhat inflated — and in fact the “single object” rows are very slightly inflated, too, as although they only contain one each of the “link-once” sections, those will still get deduplicated again against any that are also used elsewhere in the final binary. In shared libraries, on the other hand, link-once sections are deduplicated when the shared library itself is linked, but can’t be deduplicated again at dynamic link time. So those numbers really do represent the amount that gets added to the final application footprint.

Interestingly (and fortunately, seeing as libupnpd isn’t actually the whole program), using -fwhole-program still emits the link-once sections, so they do get deduplicated against the final binary.

Why, If You Really Want Small, You Should Ignore All This

As the massive leap in density between the shared-library and single-object numbers in the graph show, the compiler can do a lot better the more of your program it sees at once. So for the very smallest binaries, take this principle and turn it up to eleven: abandon (for release builds) the idea of using separate libraries at all, and compile your entire program and all its “library” code as a single translation unit.

Not that that’s a good idea if the library code itself is part of the product: if you don’t install the library, no other clients can link against it, and if you install the library but your program doesn’t use it, that’s wasted RAM when other processes load the library copies of all that code. But for “libraries” tightly wedded to a single binary, and especially for embedded systems (where typically there are no other processes), it does get you better code size than even the best options above.

About Me

Cambridge, United Kingdom
Waits for audience applause ... not a sossinge.
CC0 To the extent possible under law, the author of this work has waived all copyright and related or neighboring rights to this work.