How small can you make a library?
Of course, one way to make it smaller is to make each individual file smaller: by using an optimising compiler, by turning off exceptions, by removing debug information (in release builds). But even once you’ve done that, there’s extra overhead involved in just being a library: how can that be minimised?
One choice that affects the answer, is whether you want a shared or a static library. Sometimes other considerations force the answer, but if you have the luxury of being able to choose either type, which do you choose for best compactness?
Both types have their advantages; static libraries are useful when the client, or most clients, use only a fraction of the library facilities: unused objects from the archive simply aren’t linked. By contrast, client code that uses any one facility from a shared library must link all of it. And the position-independent code (PIC) techniques needed to build a shared library, may be more expensive than normal code on some architectures. On the other hand, shared libraries offer control over symbol visibility, so an internally-complex library with a simple interface, can end up simple rather than complex at link-time. And, because a shared library is all-or-nothing anyway, it can be built as a single translation unit with no loss of generality — enabling better compiler optimisation.
So I tried it. I took libupnpd from Chorale, which is a small-to-medium sized library although with very few entry points, and tried to make it as small as possible while retaining its identity as a separate library. Here are the sizes it came out as, in the various attempts, as reported by size(1):
amd64-linux | arm-linux | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
text | data | bss | total | relative | text | data | bss | total | relative | |||||
Static library | 161443 | 8 | 94 | 161545 | 0.0% | -23.9% | 140176 | 4 | 872 | 141052 | 0.0% | -24.1% | ||
Shared library | 205846 | 6216 | 112 | 212174 | +31.3% | 0.0% | 177469 | 7968 | 412 | 185849 | +31.8% | 0.0% | ||
Shared library, sec | 205844 | 6216 | 80 | 212140 | +31.3% | 0.0% | 177421 | 7960 | 248 | 185629 | +31.6% | - 0.1% | ||
Shared library, vis | 144726 | 5328 | 112 | 150166 | - 7.0% | -29.2% | 120012 | 7472 | 412 | 127896 | - 9.3% | -31.2% | ||
Shared library, sec vis | 144078 | 5328 | 80 | 149486 | - 7.5% | -29.5% | 119452 | 7472 | 248 | 127172 | - 9.8% | -31.6% | ||
Shared library, whole sec vis | 133582 | 5312 | 80 | 138974 | -14.0% | -34.5% | 107623 | 7468 | 412 | 115503 | -18.1% | -37.9% | ||
Single object, whole | 101011 | 3112 | 94 | 104217 | -35.5% | -50.9% | 85112 | 6688 | 380 | 92180 | -34.6% | -50.4% | ||
Single object, whole sec | 100638 | 3112 | 94 | 103844 | -35.7% | -51.1% | 85112 | 6688 | 380 | 92180 | -34.6% | -50.4% | ||
Single object, whole sec vis | 99775 | 3112 | 94 | 102981 | -36.2% | -51.5% | 85120 | 6688 | 380 | 92188 | -34.6% | -50.4% | ||
Single object, whole sec vis wp | 99287 | 3112 | 94 | 102493 | -36.6% | -51.7% | 85028 | 6688 | 380 | 92096 | -34.7% | -50.5% |
Some explanation is, of course, in order.
- 1. Static library
- Each file compiled separately, archived with ar.
- 2. Shared library
- Standard ELF shared library: each file compiled separately with -fPIC, linked with gcc -shared. (Or, in fact, using libtool for simplicity. But that’s what libtool was doing behind the scenes.)
- 3. Shared library, sec
- As (2), but compiled also with -ffunction-sections -fdata-sections, and linked with --gc-sections. This made approximately one gnat’s crotchet of difference.
- 4. Shared library, vis
- As (2), but compiled also with -fvisibility=hidden -fvisibility-inlines-hidden, and with the small number of entry points labelled __attribute__((visibility("default"))). This should indeed have been much better than (2) or (3), but it came as a surprise to me that it’s also better than (1).
- 5. Shared library, sec vis
- As (2) but with both the (3) and (4) optimisations applied. Again, --gc-sections makes a non-zero but insignificant difference.
- 6. Shared library, whole sec vis
OK, now we’re getting somewhere. This is as (5), but with the whole library built as a single translation unit. It’s the moral equivalent of GCC’s -combine, but as that only applies to C and this is C++, it’s done by writing a small .cpp file that does nothing but #include all the other .cpp files. (Doing this requires a certain discipline in the files in question, so as not to queer the pitch for subsequent code. But it’s certainly not an intolerable imposition.)
In the interests of scrupulous accuracy, I should point out that in fact not the whole of the library is compiled in the one file. One of the component parts needs special compiler options, so that one’s still compiled separately.
These settings correspond to what you get if you configure and build KDE3 with --enable-final.
- 7. Single object, whole
- This is like the non-shared-library version of (6); the whole library is compiled in a single translation unit, but this time into an ordinary, non-PIC, object file, which is then put in a library by itself.
- 8. Single object, whole sec
- As (7), but in little sections as per (3). This doesn’t make much difference on amd64-linux, and none at all on arm-linux.
- 9. Single object, whole sec vis
- As (8) but with the visibility settings too. As those settings are only meant to apply to shared libraries, this shouldn’t have made a difference. But for some reason, it did, albeit a tiny one.
- 10. Single object, whole sec vis wp
As (9) but with the single library translation unit compiled with -fwhole-program, and the exported functions labelled __attribute__((externally_visible)).
Semantically and philosophically, this is saying much the same thing as the visibility settings. But there must be some extra little bit of optimisation the compiler can do in this situation. There isn’t a corresponding statistic for a shared library with -fwhole-program, as using that option disables PIC.
Why size(1) Isn’t The Full Story
Especially considering the library is C++, some of the numbers above need to be taken with a slight pinch of salt. Most C++ programs have a lot of functions declared in header files; functions which are semantically inline but which, for various reasons (such as being listed in a vtable) aren’t actually purely inlined in practice. Such functions are emitted in every object file, in “link-once” sections, which are deduplicated by the final linker and emitted only once each.
The size(1) command counts these link-once sections under “text”, and doesn’t do any deduplication. So a static library made of many object files will often have a larger size (as quoted by size(1)) than it will actually add to the final binary — because of the multiple copies of all the link-once sections. So row (1) of the table is somewhat inflated — and in fact the “single object” rows are very slightly inflated, too, as although they only contain one each of the “link-once” sections, those will still get deduplicated again against any that are also used elsewhere in the final binary. In shared libraries, on the other hand, link-once sections are deduplicated when the shared library itself is linked, but can’t be deduplicated again at dynamic link time. So those numbers really do represent the amount that gets added to the final application footprint.
Interestingly (and fortunately, seeing as libupnpd isn’t actually the whole program), using -fwhole-program still emits the link-once sections, so they do get deduplicated against the final binary.
Why, If You Really Want Small, You Should Ignore All This
As the massive leap in density between the shared-library and single-object numbers in the graph show, the compiler can do a lot better the more of your program it sees at once. So for the very smallest binaries, take this principle and turn it up to eleven: abandon (for release builds) the idea of using separate libraries at all, and compile your entire program and all its “library” code as a single translation unit.
Not that that’s a good idea if the library code itself is part of the product: if you don’t install the library, no other clients can link against it, and if you install the library but your program doesn’t use it, that’s wasted RAM when other processes load the library copies of all that code. But for “libraries” tightly wedded to a single binary, and especially for embedded systems (where typically there are no other processes), it does get you better code size than even the best options above.
No comments:
Post a Comment