Whitebait, Kleftiko, Ekmek Special: July 2009

How small can you make a library?

Of course, one way to make it smaller is to make each individual file smaller: by using an optimising compiler, by turning off exceptions, by removing debug information (in release builds). But even once you’ve done that, there’s extra overhead involved in just being a library: how can that be minimised?

One choice that affects the answer, is whether you want a shared or a static library. Sometimes other considerations force the answer, but if you have the luxury of being able to choose either type, which do you choose for best compactness?

Both types have their advantages; static libraries are useful when the client, or most clients, use only a fraction of the library facilities: unused objects from the archive simply aren’t linked. By contrast, client code that uses any one facility from a shared library must link all of it. And the position-independent code (PIC) techniques needed to build a shared library, may be more expensive than normal code on some architectures. On the other hand, shared libraries offer control over symbol visibility, so an internally-complex library with a simple interface, can end up simple rather than complex at link-time. And, because a shared library is all-or-nothing anyway, it can be built as a single translation unit with no loss of generality — enabling better compiler optimisation.

So I tried it. I took libupnpd from Chorale, which is a small-to-medium sized library although with very few entry points, and tried to make it as small as possible while retaining its identity as a separate library. Here are the sizes it came out as, in the various attempts, as reported by size(1):

	text	data	bss	total	relative		text	data	bss	total	relative
	amd64-linux						arm-linux
Static library	161443	8	94	161545	0.0%	-23.9%	140176	4	872	141052	0.0%	-24.1%
Shared library	205846	6216	112	212174	+31.3%	0.0%	177469	7968	412	185849	+31.8%	0.0%
Shared library, sec	205844	6216	80	212140	+31.3%	0.0%	177421	7960	248	185629	+31.6%	- 0.1%
Shared library, vis	144726	5328	112	150166	- 7.0%	-29.2%	120012	7472	412	127896	- 9.3%	-31.2%
Shared library, sec vis	144078	5328	80	149486	- 7.5%	-29.5%	119452	7472	248	127172	- 9.8%	-31.6%
Shared library, whole sec vis	133582	5312	80	138974	-14.0%	-34.5%	107623	7468	412	115503	-18.1%	-37.9%
Single object, whole	101011	3112	94	104217	-35.5%	-50.9%	85112	6688	380	92180	-34.6%	-50.4%
Single object, whole sec	100638	3112	94	103844	-35.7%	-51.1%	85112	6688	380	92180	-34.6%	-50.4%
Single object, whole sec vis	99775	3112	94	102981	-36.2%	-51.5%	85120	6688	380	92188	-34.6%	-50.4%
Single object, whole sec vis wp	99287	3112	94	102493	-36.6%	-51.7%	85028	6688	380	92096	-34.7%	-50.5%

Some explanation is, of course, in order.

1. Static library

Each file compiled separately, archived with ar.

2. Shared library

Standard ELF shared library: each file compiled separately with -fPIC, linked with gcc -shared. (Or, in fact, using libtool for simplicity. But that’s what libtool was doing behind the scenes.)

3. Shared library, sec

As (2), but compiled also with -ffunction-sections -fdata-sections, and linked with --gc-sections. This made approximately one gnat’s crotchet of difference.

4. Shared library, vis

As (2), but compiled also with -fvisibility=hidden -fvisibility-inlines-hidden, and with the small number of entry points labelled __attribute__((visibility("default"))). This should indeed have been much better than (2) or (3), but it came as a surprise to me that it’s also better than (1).

5. Shared library, sec vis

As (2) but with both the (3) and (4) optimisations applied. Again, --gc-sections makes a non-zero but insignificant difference.

6. Shared library, whole sec vis

OK, now we’re getting somewhere. This is as (5), but with the whole library built as a single translation unit. It’s the moral equivalent of GCC’s -combine, but as that only applies to C and this is C++, it’s done by writing a small .cpp file that does nothing but #include all the other .cpp files. (Doing this requires a certain discipline in the files in question, so as not to queer the pitch for subsequent code. But it’s certainly not an intolerable imposition.)

In the interests of scrupulous accuracy, I should point out that in fact not the whole of the library is compiled in the one file. One of the component parts needs special compiler options, so that one’s still compiled separately.

These settings correspond to what you get if you configure and build KDE3 with --enable-final.

7. Single object, whole

This is like the non-shared-library version of (6); the whole library is compiled in a single translation unit, but this time into an ordinary, non-PIC, object file, which is then put in a library by itself.

8. Single object, whole sec

As (7), but in little sections as per (3). This doesn’t make much difference on amd64-linux, and none at all on arm-linux.

9. Single object, whole sec vis

As (8) but with the visibility settings too. As those settings are only meant to apply to shared libraries, this shouldn’t have made a difference. But for some reason, it did, albeit a tiny one.

10. Single object, whole sec vis wp

As (9) but with the single library translation unit compiled with -fwhole-program, and the exported functions labelled __attribute__((externally_visible)).

Semantically and philosophically, this is saying much the same thing as the visibility settings. But there must be some extra little bit of optimisation the compiler can do in this situation. There isn’t a corresponding statistic for a shared library with -fwhole-program, as using that option disables PIC.

The effects of these various optimisations are remarkably similar, in relative terms, across both architectures.

Why size(1) Isn’t The Full Story

Especially considering the library is C++, some of the numbers above need to be taken with a slight pinch of salt. Most C++ programs have a lot of functions declared in header files; functions which are semantically inline but which, for various reasons (such as being listed in a vtable) aren’t actually purely inlined in practice. Such functions are emitted in every object file, in “link-once” sections, which are deduplicated by the final linker and emitted only once each.

The size(1) command counts these link-once sections under “text”, and doesn’t do any deduplication. So a static library made of many object files will often have a larger size (as quoted by size(1)) than it will actually add to the final binary — because of the multiple copies of all the link-once sections. So row (1) of the table is somewhat inflated — and in fact the “single object” rows are very slightly inflated, too, as although they only contain one each of the “link-once” sections, those will still get deduplicated again against any that are also used elsewhere in the final binary. In shared libraries, on the other hand, link-once sections are deduplicated when the shared library itself is linked, but can’t be deduplicated again at dynamic link time. So those numbers really do represent the amount that gets added to the final application footprint.

Interestingly (and fortunately, seeing as libupnpd isn’t actually the whole program), using -fwhole-program still emits the link-once sections, so they do get deduplicated against the final binary.

Why, If You Really Want Small, You Should Ignore All This

As the massive leap in density between the shared-library and single-object numbers in the graph show, the compiler can do a lot better the more of your program it sees at once. So for the very smallest binaries, take this principle and turn it up to eleven: abandon (for release builds) the idea of using separate libraries at all, and compile your entire program and all its “library” code as a single translation unit.

Not that that’s a good idea if the library code itself is part of the product: if you don’t install the library, no other clients can link against it, and if you install the library but your program doesn’t use it, that’s wasted RAM when other processes load the library copies of all that code. But for “libraries” tightly wedded to a single binary, and especially for embedded systems (where typically there are no other processes), it does get you better code size than even the best options above.

Whitebait, Kleftiko, Ekmek Special

Thursday 16 July 2009

Little Libraries

Why size(1) Isn’t The Full Story

Why, If You Really Want Small, You Should Ignore All This

Blog Archive

Labels

About Me

Still me, but elsewhere