Whitebait, Kleftiko, Ekmek Special: May 2009

Saturday, 30 May 2009

How Not To Develop On Windows

This is a HOW-TO on writing Windows software, including GUI software, without using a Windows box at all except for final testing. Among other benefits, this lets you develop in the simple, testable, scriptable Linux environment; it means you don’t have to forever check things in and out of source-control to test whether you’ve broken the other platform’s builds; and it’s also handy if you’ve got a powerful Linux box but only a feeble Windows box.

Well, actually, it isn’t really a HOW-TO, in the grand tradition; it’s more of a “WHETHER-TO”. That’s because it doesn’t always go into enough detail to let you reconstruct everything described, but it at least tells you that it’s possible — so that if you’re wondering whether-to try and develop Windows software in this fashion, you can be reassured, before setting off in that direction, that the destination is attainable.

There are a lot of bits and pieces to set up a complete cross-development and cross-test environment. We should all be hugely grateful for the vast amount of development effort put in by the GCC, binutils, Mingw, Wine, Qt and other projects to enable the setup described in this one small blog post.

1. A cross-compiler toolchain targetting Mingw32

apt-get install mingw32-binutils mingw32-runtime mingw32

Alternatively, configure and install GNU binutils for target “i586-mingw32”; install the headers from the mingw32-runtime package (which you can’t build yet); configure and install a Stage 1 cross-GCC with

--target=i586-mingw32 --enable-languages="c" --enable-threads=win32 --disable-libmudflap --disable-libssp --enable-__cxa_atexit --enable-sjlj-exceptions --disable-win32-registry --with-gnu-as --with-gnu-ld

as extra configure arguments; configure and install the w32api package; configure and install the mingw32-runtime package; and finally configure and install a Stage 2 fully-working cross-GCC with

--target=i586-mingw32 --enable-languages="c,c++" --enable-threads=win32 --disable-libmudflap --disable-libssp --enable-__cxa_atexit --enable-sjlj-exceptions --disable-win32-registry --with-gnu-as --with-gnu-ld --disable-libstdcxx-pch --enable-libstdcxx-allocator=new

as configure arguments. Note that, supposing you want everything to live in /usr/local/i586-mingw32, you need to give GCC and binutils “--prefix=/usr/local”, and everything else “--prefix=/usr/local/i586-mingw32”.

Except for using w32api and mingw-runtime instead of glibc, this isn’t that different from how to build a cross-compiler for any other target.

If you want to use the exact same versions of everything I did, it’s binutils 2.19.51.0.2, mingw-runtime 3.14, w32api 3.11, and GCC 4.3.3.

A recently-invented third alternative, which I haven’t tried, is the “official” cross-hosted Mingw build tool scripts, which are available on the Mingw Sourceforge page.

2. A native pkgconfig for your cross-compiled libraries

Cross-compiled libraries for Mingw32 will put their pkgconfig “.pc” files in /usr/local/i586-mingw32/lib/pkgconfig. In order for configure scripts targetting Mingw32 to find them, you’ll need a “cross-pkgconfig” — but one which, like a cross-compiler, is built for the build platform, not the target platform. If it’s named using the target prefix, as if it were part of the cross-compiler — i.e., in our case, i586-mingw32-pkgconfig — configure scripts will use it to determine which cross-compiled libraries are present.

Configure pkgconfig 0.23 with:

--with-pc-path=/usr/local/i586-mingw32/lib/pkgconfig --program-prefix=i586-mingw32-

(yes, that ends with a hyphen).

3. Cross-compiled versions of all the libraries you use

How hard it is to arrange for these, depends a lot on each individual library. In theory all you should need to do is configure the library with “--prefix=/usr/local/i586-mingw32 --host=i586-mingw32”, but in practice very few libraries do the Right Thing with that alone. (Honourable mentions here go to libxml2 and taglib.)

Other things you might have to do to more recalcitrant libraries include: setting CC=i586-mingw32-gcc (and sometimes CXX, AR and/or RANLIB similarly); disabling parts of the library (for libcdio use

--disable-joliet --disable-example-progs --without-iso-info --without-iso-read --without-cd-info --without-cd-drive --without-cd-read

to disable all the example programs) — or, if the worst comes to the worst, actually patching out bits of the library. I had to do that to make taglib compile as a static library.

Boost, as usual, presents the most challenging fight you’ll have with a build system. Here, without further commentary, is the Makefile snippet needed to cross-compile Boost for Mingw32; $(BUILD) is the build directory and $(PREFIX) is where to install the result — /usr/local would match a toolchain built as described above:

cross-mingw32-boost:
        mkdir -p $(BUILD)/cross-mingw32-boost
        tar xjf boost-*bz2 -C $(BUILD)/cross-mingw32-boost
        cd $(BUILD)/cross-mingw32-boost/* \
                && ./bootstrap.sh --prefix=$(PREFIX)/i586-mingw32 \
                        --libdir=$(PREFIX)/i586-mingw32/lib \
                        --includedir=$(PREFIX)/i586-mingw32/include \
                && echo \
"using gcc : : i586-mingw32-g++ : <compileflags>-mthreads <linkflags>-mthreads ;" \
              > jtl-config.jam \
                && ./bjam -q --layout=system variant=release \
                        link=static threading=multi --without-iostreams \
                        -sGXX=i586-mingw32-g++ --without-python \
                        threadapi=win32 --user-config=jtl-config.jam \
                && sudo rm -rf $(PREFIX)/i586-mingw32/include/boost \
                && sudo ./bjam -q --layout=system variant=release \
                        link=static threading=multi --without-iostreams \
                        -sGXX=i586-mingw32-g++ --without-python \
                        threadapi=win32 --user-config=jtl-config.jam install
        for i in $(PREFIX)/i586-mingw32/lib/libboost_*.a ; do \
   sudo i586-mingw32-ranlib $$i ; \
 done
        rm -rf $(BUILD)/cross-mingw32-boost

Again, if you’d like a checklist of successes I’ve had here, then with greater or lesser effort it’s proved possible to make cross-compiled versions of zlib 1.2.3, Boost 1.39.0, libcdio 0.80, taglib 1.5, and libxml2 2.6.30.

4. Wine Is Not an Emulator

Configure and install Wine 1.0.1. Admirably, this just works out of the box, though if your Linux box is 64-bit, you’ll need the 32-bit versions of its dependent libraries installed.

Having got this far, you should have all you need to compile, link, and test Windows programs. Of course, they do have to be Windows programs; Mingw is not Cygwin, and your program needs to be compatible with real, proper Windows including <windows.h>, WSAEventSelect, CreateWindowEx and all that jazz — plus, of course, the Windows text-encoding and file-naming rules.

Indeed, depending on how your project’s unit-tests are set up, you can probably run most of them, too, under Wine. Just arrange for them to be invoked using Wine: instead of “run-test various-args”, execute instead “wine run-test various-args”. In some situations, this alone would justify the effort of setting up a cross-development environment: the ability to know, before checking in code on Linux, that it passes all its tests both on Linux and Windows.

5. Qt

Trolltech’s, now Nokia’s, Qt framework has for a while been a really good way of writing open-source Linux GUI applications without getting bogged down in X11 or other unhelpful toolkits. Originally Qt was only available for free on the X11 platform, but subsequently the Windows (and even MacOS) versions were also made available to the free software community, and more recently still have been relicensed under the GNU LGPL. This makes it not only a good way of writing open-source Linux applications, but both open-source and proprietary applications on Linux and Windows (and, again, MacOS too).

So it would be handy if a cross-compiled version of Qt could be used to write and test Windows versions of Linux Qt applications using Wine. The problem is, Qt’s sources are huge — twice the size of KOffice, three times the size of Firefox, five times the size of GLib+GTK put together — which is enough to put a fellow off trying to cross-compile it.

But fortunately, Trolltech supply a binary installer for the Windows development libraries — an installer which works under Wine. So, download qt-win-opensource-x.y.z.exe and run it under Wine. Pick an installation directory (for instance, /usr/local/i586-mingw32/qt — or, in Mingw-speak, Z:\usr\local\i586-mingw32\qt), and let it install. When it asks for a Mingw installation, give it your cross-compiler’s prefix directory (e.g. /usr/local/i586-mingw32); it’ll moan, but let you ignore the moaning and install anyway (do so).

You then need to arrange for Qt’s pkgconfig files to be available to the cross-compiler. The Win32 installation of Qt doesn’t have pkgconfig files, but you can modify the ones from a native Linux installation of the same version of Qt. To do this, issue the following commands (as root):

# cd /usr/lib/pkgconfig     (Or wherever your existing QtCore.pc is)
# for i in Qt*.pc ; do sed \
   -e 's,^prefix=.*,prefix=/usr/local/i586-mingw32/qt,' \
   -e 's,-I.*/include,-I/usr/local/i586-mingw32/qt/include,' \
   -e 's,-l[^ ]*,&4,' \
 < $i > /usr/local/i586-mingw32/lib/pkgconfig/$i ; done

The three sed replacements fix up the prefix= lines in the .pc files, then fix up stray -I directives in the CFLAGS lines that don’t use the defined prefix, then finally take account of the extra versioning present in the Windows filenames (instead of QtCore.lib, Windows has QtCore4.lib, and similarly across the whole framework).

(It was getting a development version of Chorale’s Qt GUI more-or-less up under Wine, and thus bringing Win32 into its sights for the first time, that prompted the writing of this blog post.)

6. The Promised Land

So there you (hopefully, by now) have it. A well-behaved program, such as Chorale, should mostly configure, build, unit-test, and run its Windows version straightforwardly on a Linux box.

Naturally, you still need to use a real Windows box for final testing — not everything that can happen on a real Windows box can be modelled inside Wine, and nor is everything necessarily as compatible between different Windows versions as you’d hope. But by marginalising Windows out of its own development process until as late as possible, the rest of the development can be eased, indeed accelerated, by only having to develop on a single platform.

Tuesday, 26 May 2009

PathCanonicalize Versus What It Says On The Tin

This post contains MathML mark-up, which may not display correctly on some browsers. (And if you’re wondering how to do MathML in Blogger, see here.)

Here’s what canonicalisation is. You’ve got a set of items, and you can test those items for equality, $x = y$ , but what you actually want to do is test for equivalence, $x ≍ y$ — where equality implies equivalence, but equivalence doesn’t imply equality. What’s needed is a canonicalisation function which maps the equivalence relation onto the equality relation, by mapping all (of each subset of) equivalent items onto a single representative item; a function $f$ such that $f (x) = f (y) \Leftrightarrow x ≍ y$ (plus you want canonicalisation to be idempotent: $f (f (x)) \equiv f (x)$ ).

More concretely, suppose you’re keeping a database of disk files, perhaps to enable searching or browsing. The question is, when do two filenames refer to the same file? You can’t just test them for string equality, as two distinct names might refer to the same file: on Unix, /home/peter/foo and /home/peter/src/../foo are the same file. Plus, symbolic links can be used to make even unrelated-looking names refer to the same file. If your database lists the file by one name, and someone does a query looking for another name, it won’t be found — and worse, the file could get into the database several times under different names, perhaps with conflicting or stale information.

But fortunately, “..”, “.”, and symlinks between them are about the size of it for Unix ways of obscuring the naming of a file, and the standard library comes with a suitable canonicalisation function that will reduce elaborated forms into the single unambiguous original pathname. (In fact, the GNU C library comes with two such functions, as the more-portable one, realpath(), needs a little care in use in order to avoid buffer-overrun attacks; the GNU one, canonicalize_file_name(), does not.) So you canonicalise all filenames as you store them into your database, and make sure to canonicalise all filenames in queries before you look them up, and you’ll get the right matches.

And then eventually you’re going to want to port that software to Windows — whereupon you’ve got a problem.

Indeed, you’ve got a whole little family of problems, because even once you’ve navigated the treacherous waters of the textual encoding of filenames under Win32, there still remain a bewildering variety of ways to refer to the same file:

`D:\Music\Fools Gold.flac`	— Probably canonical
`D:/Music/Fools Gold.flac`	— Slash versus backslash
`D:\MUSIC\Fools Gold.flac`	— Case-insensitive per locale
`D:\Music\FOOLSG~1.FLA`	— MS-DOS 8.3
`M:\Fools Gold.flac`	— After “subst M: D:\Music”
`\Device\HarddiskVolume2\Music\Fools Gold.flac`	— If D: is local
`\\server\share\Music\Fools Gold.flac`	— If D: is a network drive
`\\?\UNC\server\share\Music\Fools Gold.flac`	— Or like this
`\\?\D:\Music\Fools Gold.flac`	— Ultra-long-filenames mode
`\\.\D:\Music\Fools Gold.flac`	— Device namespace
`\\?\UNC\D:\Music\Fools Gold.flac`	— Allegedly
`\\?\Volume{GUID}\Music\Fools Gold.flac`	— Crikey

This whole dismal farrago really calls for a path canonicalisation function. Which is why it’s unfortunate that there isn’t one, and doubly unfortunate that there’s a function called PathCanonicalize() that particularly isn’t one, and not just because it’s spelled with a “Z”. All that PathCanonicalize() does is remove “/../” and “/./” substrings — it’s a purely textual transformation and doesn’t even touch the filesystem. It certainly doesn’t satisfy the “canonicaliser condition”:

$f (x) = f (y) \Leftrightarrow x$ and $y$ are the same file

No, there’s no shortcut for doing it laboriously and textually (nor for having lots of unit tests to cover all those ridiculous cases). The plan is, use GetFullPathName() to turn relative paths into absolute, then repeatedly call QueryDosDevice() to unwind subst’d drive letters, then call GetLongPathName() to get rid of 8.3-ness and canonicalise case, and then finally, if GetDriveType() says it’s remote, use WNetGetUniversalName() to convert the remaining drive letter into a UNC path.

std::string Canonicalise(const std::string& path)
{
    std::wstring utf16 = UTF8ToUTF16(path);

    wchar_t canon[MAX_PATH];

    /** Note that PathCanonicalize does NOT do what we want here -- it's a
     * purely textual operation that eliminates /./ and /../ only.
     */
    DWORD rc = ::GetFullPathNameW(utf16.c_str(), MAX_PATH, canon, NULL);
    if (!rc)
        return path;

    utf16 = canon;

    if (utf16.length() >= 6)
    {
        /** Get rid of \\?\ and \\.\ prefixes on drive-letter paths */
        if (!wcsncmp(utf16.c_str(), L"\\\\?\\", 4) && utf16[5] == L':')
            utf16.erase(0,4);
        else if (!wcsncmp(utf16.c_str(), L"\\\\.\\", 4) && utf16[5] == L':')
            utf16.erase(0,4);
    }

    if (utf16.length() >= 10)
    {
        /** Get rid of \\?\UNC on drive-letter and UNC paths */
        if (!wcsncmp(utf16.c_str(), L"\\\\?\\UNC\\", 8))
        {
            if (utf16[9] == L':' && utf16[10] == L'\\')
                utf16.erase(0,8);
            else
            {
                utf16.erase(0,7);
                utf16 = L"\\" + utf16;
            }
        }
    }

    /** Anything other than UNC and drive-letter is something we don't
     * understand
     */
    if (utf16[0] == L'\\' && utf16[1] == L'\\')
    {
        if (utf16[2] == '?' || utf16[2] == '.')
            return path; // Not understood

        /** OK -- UNC */
    }
    else if (((utf16[0] >= 'A' && utf16[0] <= 'Z')
              || (utf16[0] >= 'a' && utf16[0] <= 'z'))
             && utf16[1] == ':')
    {
        /** OK -- drive letter -- unwind subst'ing */
        for (;;)
        {
            wchar_t drive[3];
            drive[0] = (wchar_t)toupper(utf16[0]);
            drive[1] = L':';
            drive[2] = L'\0';
            canon[0] = L'\0';
            rc = ::QueryDosDeviceW(drive, canon, MAX_PATH);
            if (!rc)
                break;
            if (!wcsncmp(canon, L"\\??\\", 4))
            {
                utf16 = std::wstring(canon+4) + std::wstring(utf16, 2);
            }
            else // Not subst'd
                break;
        }

        wchar_t drive[4];
        drive[0] = (wchar_t)toupper(utf16[0]);
        drive[1] = ':';
        drive[2] = '\\';
        drive[3] = '\0';

        rc = ::GetDriveTypeW(drive);

        if (rc == DRIVE_REMOTE)
        {
            DWORD bufsize = MAX_PATH;

            /* QueryDosDevice and WNetGetConnection FORBID the
             * trailing slash; GetDriveType REQUIRES it.
             */
            drive[2] = '\0';

            rc = ::WNetGetConnectionW(drive, canon, &bufsize);
            if (rc == NO_ERROR)
                utf16 = std::wstring(canon) + std::wstring(utf16, 2);
        }
    }
    else
    {
        // Not understood
        return path;
    }

    /** Canonicalise case and 8.3-ness */
    rc = ::GetLongPathNameW(utf16.c_str(), canon, MAX_PATH);
    if (!rc)
        return path;

    std::string utf8 = UTF16ToUTF8(canon);
    std::replace(utf8.begin(), utf8.end(), '\\', '/');
    return utf8;
}

There are still ways to fool this function: for instance, by exporting a directory as \\server\share1 and a subdirectory of it as \\server\share2 — the client has no way of matching them up. But that’s a pretty pathological case, and it could be easily argued that it’s something you’d never do unless you wanted the shares to appear to be distinct. More seriously, the server for a network drive can be specified by WINS name, by FQDN or by IP address; neither canonicalise-to-IP nor canonicalise-to-FQDN is the Right Thing in all cases. For now I’m sweeping that issue under the carpet.

The one remaining wrinkle is that, unlike earlier versions of Windows, Vista allows “genuine” Unix-like symbolic links. Without doing real testing on Vista, it’s hard to make out from the documentation how the APIs used above behave when faced with such symbolic links. It’s even possible that the new-in-Vista GetFinalPathNameByHandle() call is the answer to all these problems; in which case, this code gets demoted to merely the way to do it on pre-Vista versions.

Wednesday, 6 May 2009

Unicode Is The One True God, And UTF-8 Is His Prophet

One common anti-pattern in the Win32 API is the one whereby something is done wrong the first time, then later re-implemented properly — but, for compatibility reasons, the older, broken implementation can never actually be removed or fixed. (The uncharitable might say that another common anti-pattern is the same thing, but where the re-implementation doesn’t work either.)

An example of one or both of these anti-patterns is encoding support. The Win32 API’s Unicode support was invented before UTF-8 was, so Microsoft can’t exactly be blamed for not using UTF-8. Instead they used UCS-2 (later retconned to UTF-16), but, with 16-bit characters, this was grievously incompatible with Win16 — and, indeed, with all other C source in the world at that time. So they kept an 8-bit-character “ANSI” API around as well, compatible with the 8-bit code-pages of Win16.

Without the same compatibility millstone around its neck, Linux (eventually) embraced UTF-8 into its locale system, to the extent that nowadays it’s reasonable to describe all non-UTF-8 locales as “legacy”. When writing portable software (location-portable as well as system-portable) for Linux, it’s quite the natural thing to do to store all text in UTF-8, and, if happening to find oneself running in a legacy locale, to translate to UTF-8 on the way in and from UTF-8 on the way out. That way, almost all the code can just deal with normal char* and assume it’s encoded in UTF-8.

And then eventually you’re going to want to port that software to Windows — whereupon you’ve got a problem.

Indeed, you have a whole little family of problems, and they’ve got names such as fopen, rename, unlink, and more. All of these calls — and those are just the ISO C ones; Posix and C++ add more — take 8-bit strings as filename arguments, and, in Windows, there are some files that this just won’t let you see. Files in Windows filesystems have UTF-16 names, which, when you use 8-bit calls, are mapped down into the characters available in the current ANSI code-page. (Strictly speaking, the code-page set as the “Code-page for non-Unicode applications” in Control Panel.)

So suppose you have your code-page for non-Unicode applications set to CP1252 for generic Western European, and your Greek friend, who has his code-page for non-Unicode applications set to CP1253 for Greek, writes two files with Greek names onto a USB stick and hands it to you. The filenames will be stored as UTF-16, featuring characters which do not exist in your code-page. And if you have two files whose names are only distinguishable by characters not in your code-page, you can’t distinguish them — they’ll both come back as, say, “????.txt”, and, while there’s enough grody hacks in Windows that fopen will actually open “????.txt”, that still only gets you one of the two files. There’s no way of opening the other one using 8-bit calls.

There’s only a few things you can do about this. The easiest, certainly, is to ignore it completely — after all, most people only create files named in their own language. This is the position taken by Cygwin (before 1.7, anyway) and thus, implicitly, by all Unix software which has been ported to Windows by simply recompiling it against Cygwin mentioning no names. It is, of course, very unsatisfactory — for instance, among non-Czechs partial to MP3s of music composed by Antonín Dvořák.

Alternatively, you could go through all your code replacing all the strings used to keep filenames in with wide-strings, all the calls to fopen with _wfopen, and so on — or use all the ugly macros such as _tfopen (which was the Microsoft advice even for Windows-only programs, back in the day when Windows 95 was Win32 but not Unicode, so it was common to make two binary builds of the same sources). But this is a lot of work — and error-prone, particularly if you rely on other libraries that have been ported to Windows using the ignore-the-problem technique, and you thus have to go and update them too.

Fortunately, there’s another way of solving the problem — at least, if you’re targetting Windows using Mingw and are thus using the GNU binutils linker. If you add

-Wl,--wrap,fopen -Wl,--wrap,rename -Wl,--wrap,unlink … etc.

to the linker command-line, all calls to (for instance) fopen will be converted into calls to __wrap_fopen, which you can implement in your own code. In particular, you can make it do something like this:

extern "C" FILE* __cdecl __wrap_fopen(const char *path, const char *mode)
{
    std::wstring wpath, wmode;
    util::UTF8ToWide(path, &wpath);
    util::UTF8ToWide(mode, &wmode);
    return _wfopen(wpath.c_str(), wmode.c_str());
}

extern "C" int __cdecl __wrap_rename(const char *oldname, const char *newname)
{
    std::wstring woldname, wnewname;
    util::UTF8ToWide(oldname, &woldname);
    util::UTF8ToWide(newname, &wnewname);
    return _wrename(woldname.c_str(), wnewname.c_str());
}

(You don’t, of course, get prototypes for __wrap_fopen or __wrap_rename, so you must carefully ensure to give them the same signatures as the genuine functions.) This way, all your code can carry on using UTF-8 filenames everywhere, passing them to system calls such as fopen, and even passing them to external libraries, so long as you’ve wrapped the calls the libraries themselves make. (Which you can check by asking the linker for a map file: -Wl,-Map,map.txt.)

Naturally, there’s no point using ANSI Win32 APIs in Windows-specific parts of your code, only to then wrap them — it only makes sense there to go straight for the UTF-16 ones (with the W on the end, such as RegQueryValueExW) to start with. This wrapping technique is a way of avoiding rewriting the existing portable parts of your code.

It’s like having an ANSI code-page of UTF-8 all to your very own.

Whitebait, Kleftiko, Ekmek Special