Not in fact any relation to the famous large Greek meal of the same name.

Wednesday, 6 May 2009

Unicode Is The One True God, And UTF-8 Is His Prophet

One common anti-pattern in the Win32 API is the one whereby something is done wrong the first time, then later re-implemented properly — but, for compatibility reasons, the older, broken implementation can never actually be removed or fixed. (The uncharitable might say that another common anti-pattern is the same thing, but where the re-implementation doesn’t work either.)

An example of one or both of these anti-patterns is encoding support. The Win32 API’s Unicode support was invented before UTF-8 was, so Microsoft can’t exactly be blamed for not using UTF-8. Instead they used UCS-2 (later retconned to UTF-16), but, with 16-bit characters, this was grievously incompatible with Win16 — and, indeed, with all other C source in the world at that time. So they kept an 8-bit-character “ANSI” API around as well, compatible with the 8-bit code-pages of Win16.

Without the same compatibility millstone around its neck, Linux (eventually) embraced UTF-8 into its locale system, to the extent that nowadays it’s reasonable to describe all non-UTF-8 locales as “legacy”. When writing portable software (location-portable as well as system-portable) for Linux, it’s quite the natural thing to do to store all text in UTF-8, and, if happening to find oneself running in a legacy locale, to translate to UTF-8 on the way in and from UTF-8 on the way out. That way, almost all the code can just deal with normal char* and assume it’s encoded in UTF-8.

And then eventually you’re going to want to port that software to Windows — whereupon you’ve got a problem.

Indeed, you have a whole little family of problems, and they’ve got names such as fopen, rename, unlink, and more. All of these calls — and those are just the ISO C ones; Posix and C++ add more — take 8-bit strings as filename arguments, and, in Windows, there are some files that this just won’t let you see. Files in Windows filesystems have UTF-16 names, which, when you use 8-bit calls, are mapped down into the characters available in the current ANSI code-page. (Strictly speaking, the code-page set as the “Code-page for non-Unicode applications” in Control Panel.)

So suppose you have your code-page for non-Unicode applications set to CP1252 for generic Western European, and your Greek friend, who has his code-page for non-Unicode applications set to CP1253 for Greek, writes two files with Greek names onto a USB stick and hands it to you. The filenames will be stored as UTF-16, featuring characters which do not exist in your code-page. And if you have two files whose names are only distinguishable by characters not in your code-page, you can’t distinguish them — they’ll both come back as, say, “????.txt”, and, while there’s enough grody hacks in Windows that fopen will actually open “????.txt”, that still only gets you one of the two files. There’s no way of opening the other one using 8-bit calls.

There’s only a few things you can do about this. The easiest, certainly, is to ignore it completely — after all, most people only create files named in their own language. This is the position taken by Cygwin (before 1.7, anyway) and thus, implicitly, by all Unix software which has been ported to Windows by simply recompiling it against Cygwin mentioning no names. It is, of course, very unsatisfactory — for instance, among non-Czechs partial to MP3s of music composed by Antonín Dvořák.

Alternatively, you could go through all your code replacing all the strings used to keep filenames in with wide-strings, all the calls to fopen with _wfopen, and so on — or use all the ugly macros such as _tfopen (which was the Microsoft advice even for Windows-only programs, back in the day when Windows 95 was Win32 but not Unicode, so it was common to make two binary builds of the same sources). But this is a lot of work — and error-prone, particularly if you rely on other libraries that have been ported to Windows using the ignore-the-problem technique, and you thus have to go and update them too.

Fortunately, there’s another way of solving the problem — at least, if you’re targetting Windows using Mingw and are thus using the GNU binutils linker. If you add

-Wl,--wrap,fopen -Wl,--wrap,rename -Wl,--wrap,unlink … etc.
to the linker command-line, all calls to (for instance) fopen will be converted into calls to __wrap_fopen, which you can implement in your own code. In particular, you can make it do something like this:
extern "C" FILE* __cdecl __wrap_fopen(const char *path, const char *mode)
{
    std::wstring wpath, wmode;
    util::UTF8ToWide(path, &wpath);
    util::UTF8ToWide(mode, &wmode);
    return _wfopen(wpath.c_str(), wmode.c_str());
}

extern "C" int __cdecl __wrap_rename(const char *oldname, const char *newname)
{
    std::wstring woldname, wnewname;
    util::UTF8ToWide(oldname, &woldname);
    util::UTF8ToWide(newname, &wnewname);
    return _wrename(woldname.c_str(), wnewname.c_str());
}

(You don’t, of course, get prototypes for __wrap_fopen or __wrap_rename, so you must carefully ensure to give them the same signatures as the genuine functions.) This way, all your code can carry on using UTF-8 filenames everywhere, passing them to system calls such as fopen, and even passing them to external libraries, so long as you’ve wrapped the calls the libraries themselves make. (Which you can check by asking the linker for a map file: -Wl,-Map,map.txt.)

Naturally, there’s no point using ANSI Win32 APIs in Windows-specific parts of your code, only to then wrap them — it only makes sense there to go straight for the UTF-16 ones (with the W on the end, such as RegQueryValueExW) to start with. This wrapping technique is a way of avoiding rewriting the existing portable parts of your code.

It’s like having an ANSI code-page of UTF-8 all to your very own.

No comments:

Post a Comment

About Me

Cambridge, United Kingdom
Waits for audience applause ... not a sossinge.
CC0 To the extent possible under law, the author of this work has waived all copyright and related or neighboring rights to this work.