# Whitebait, Kleftiko, Ekmek Special

Not in fact any relation to the famous large Greek meal of the same name.

## Tuesday, 26 May 2009

### PathCanonicalize Versus What It Says On The Tin

This post contains MathML mark-up, which may not display correctly on some browsers. (And if you’re wondering how to do MathML in Blogger, see here.)

Here’s what canonicalisation is. You’ve got a set of items, and you can test those items for equality, $x=y$, but what you actually want to do is test for equivalence, $x\asymp y$ — where equality implies equivalence, but equivalence doesn’t imply equality. What’s needed is a canonicalisation function which maps the equivalence relation onto the equality relation, by mapping all (of each subset of) equivalent items onto a single representative item; a function $f$ such that $f x = f y ⇔ x≍y$ (plus you want canonicalisation to be idempotent: $f\left(f\left(x\right)\right)\equiv f\left(x\right)$ ).

More concretely, suppose you’re keeping a database of disk files, perhaps to enable searching or browsing. The question is, when do two filenames refer to the same file? You can’t just test them for string equality, as two distinct names might refer to the same file: on Unix, /home/peter/foo and /home/peter/src/../foo are the same file. Plus, symbolic links can be used to make even unrelated-looking names refer to the same file. If your database lists the file by one name, and someone does a query looking for another name, it won’t be found — and worse, the file could get into the database several times under different names, perhaps with conflicting or stale information.

But fortunately, “..”, “.”, and symlinks between them are about the size of it for Unix ways of obscuring the naming of a file, and the standard library comes with a suitable canonicalisation function that will reduce elaborated forms into the single unambiguous original pathname. (In fact, the GNU C library comes with two such functions, as the more-portable one, realpath(), needs a little care in use in order to avoid buffer-overrun attacks; the GNU one, canonicalize_file_name(), does not.) So you canonicalise all filenames as you store them into your database, and make sure to canonicalise all filenames in queries before you look them up, and you’ll get the right matches.

And then eventually you’re going to want to port that software to Windows — whereupon you’ve got a problem.

Indeed, you’ve got a whole little family of problems, because even once you’ve navigated the treacherous waters of the textual encoding of filenames under Win32, there still remain a bewildering variety of ways to refer to the same file:

 D:\Music\Fools Gold.flac — Probably canonical D:/Music/Fools Gold.flac — Slash versus backslash D:\MUSIC\Fools Gold.flac — Case-insensitive per locale D:\Music\FOOLSG~1.FLA — MS-DOS 8.3 M:\Fools Gold.flac — After “subst M: D:\Music” \Device\HarddiskVolume2\Music\Fools Gold.flac — If D: is local \\server\share\Music\Fools Gold.flac — If D: is a network drive \\?\UNC\server\share\Music\Fools Gold.flac — Or like this \\?\D:\Music\Fools Gold.flac — Ultra-long-filenames mode \\.\D:\Music\Fools Gold.flac — Device namespace \\?\UNC\D:\Music\Fools Gold.flac — Allegedly \\?\Volume{GUID}\Music\Fools Gold.flac — Crikey

This whole dismal farrago really calls for a path canonicalisation function. Which is why it’s unfortunate that there isn’t one, and doubly unfortunate that there’s a function called PathCanonicalize() that particularly isn’t one, and not just because it’s spelled with a “Z”. All that PathCanonicalize() does is remove “/../” and “/./” substrings — it’s a purely textual transformation and doesn’t even touch the filesystem. It certainly doesn’t satisfy the “canonicaliser condition”:

$f\left(x\right)=f\left(y\right)\phantom{\rule{1em}{0ex}}⇔\phantom{\rule{1em}{0ex}}x$ and $y$ are the same file

No, there’s no shortcut for doing it laboriously and textually (nor for having lots of unit tests to cover all those ridiculous cases). The plan is, use GetFullPathName() to turn relative paths into absolute, then repeatedly call QueryDosDevice() to unwind subst’d drive letters, then call GetLongPathName() to get rid of 8.3-ness and canonicalise case, and then finally, if GetDriveType() says it’s remote, use WNetGetUniversalName() to convert the remaining drive letter into a UNC path.

```std::string Canonicalise(const std::string& path)
{
std::wstring utf16 = UTF8ToUTF16(path);

wchar_t canon[MAX_PATH];

/** Note that PathCanonicalize does NOT do what we want here -- it's a
* purely textual operation that eliminates /./ and /../ only.
*/
DWORD rc = ::GetFullPathNameW(utf16.c_str(), MAX_PATH, canon, NULL);
if (!rc)
return path;

utf16 = canon;

if (utf16.length() >= 6)
{
/** Get rid of \\?\ and \\.\ prefixes on drive-letter paths */
if (!wcsncmp(utf16.c_str(), L"\\\\?\\", 4) && utf16[5] == L':')
utf16.erase(0,4);
else if (!wcsncmp(utf16.c_str(), L"\\\\.\\", 4) && utf16[5] == L':')
utf16.erase(0,4);
}

if (utf16.length() >= 10)
{
/** Get rid of \\?\UNC on drive-letter and UNC paths */
if (!wcsncmp(utf16.c_str(), L"\\\\?\\UNC\\", 8))
{
if (utf16[9] == L':' && utf16[10] == L'\\')
utf16.erase(0,8);
else
{
utf16.erase(0,7);
utf16 = L"\\" + utf16;
}
}
}

/** Anything other than UNC and drive-letter is something we don't
* understand
*/
if (utf16[0] == L'\\' && utf16[1] == L'\\')
{
if (utf16[2] == '?' || utf16[2] == '.')
return path; // Not understood

/** OK -- UNC */
}
else if (((utf16[0] >= 'A' && utf16[0] <= 'Z')
|| (utf16[0] >= 'a' && utf16[0] <= 'z'))
&& utf16[1] == ':')
{
/** OK -- drive letter -- unwind subst'ing */
for (;;)
{
wchar_t drive[3];
drive[0] = (wchar_t)toupper(utf16[0]);
drive[1] = L':';
drive[2] = L'\0';
canon[0] = L'\0';
rc = ::QueryDosDeviceW(drive, canon, MAX_PATH);
if (!rc)
break;
if (!wcsncmp(canon, L"\\??\\", 4))
{
utf16 = std::wstring(canon+4) + std::wstring(utf16, 2);
}
else // Not subst'd
break;
}

wchar_t drive[4];
drive[0] = (wchar_t)toupper(utf16[0]);
drive[1] = ':';
drive[2] = '\\';
drive[3] = '\0';

rc = ::GetDriveTypeW(drive);

if (rc == DRIVE_REMOTE)
{
DWORD bufsize = MAX_PATH;

/* QueryDosDevice and WNetGetConnection FORBID the
* trailing slash; GetDriveType REQUIRES it.
*/
drive[2] = '\0';

rc = ::WNetGetConnectionW(drive, canon, &bufsize);
if (rc == NO_ERROR)
utf16 = std::wstring(canon) + std::wstring(utf16, 2);
}
}
else
{
// Not understood
return path;
}

/** Canonicalise case and 8.3-ness */
rc = ::GetLongPathNameW(utf16.c_str(), canon, MAX_PATH);
if (!rc)
return path;

std::string utf8 = UTF16ToUTF8(canon);
std::replace(utf8.begin(), utf8.end(), '\\', '/');
return utf8;
}
```

There are still ways to fool this function: for instance, by exporting a directory as \\server\share1 and a subdirectory of it as \\server\share2 — the client has no way of matching them up. But that’s a pretty pathological case, and it could be easily argued that it’s something you’d never do unless you wanted the shares to appear to be distinct. More seriously, the server for a network drive can be specified by WINS name, by FQDN or by IP address; neither canonicalise-to-IP nor canonicalise-to-FQDN is the Right Thing in all cases. For now I’m sweeping that issue under the carpet.

The one remaining wrinkle is that, unlike earlier versions of Windows, Vista allows “genuine” Unix-like symbolic links. Without doing real testing on Vista, it’s hard to make out from the documentation how the APIs used above behave when faced with such symbolic links. It’s even possible that the new-in-Vista GetFinalPathNameByHandle() call is the answer to all these problems; in which case, this code gets demoted to merely the way to do it on pre-Vista versions.

1. if GetDriveType() says it’s remote, use WNetGetUniversalName() to convert the remaining drive letter into a UNC path.

It's difficult to know whether this is always the right thing to do. Suppose I keep my music collection on a machine called korma and map drive M: to point there. I later buy a new machine called dhal with oodles more disk space and move my music there. I keep korma around to use for something else. I can make this transition seamless by changing the drive letter mapping to point at dhal. Simplistic programs (along with my brain and fingers) don't notice the transition but paths canonicalised by your scheme have now all changed.

Perhaps the only downside to this is a lot of disk grinding while whatever is using these files notices that many files have disappeared and many have arrived. I suppose it depends ultimately whether waiting for the new files to be noticed is an inconvenience and whether the canonicalised paths have been used as file identifiers elsewhere.

2. Yes, and there's other situations too, like an itinerant laptop where M: is either a share on your home NAS or a share on your office desktop, depending on where the laptop currently is, but with the same content in each case. It's not guaranteed that canonicalising M: to UNC is always right. But there is a solution in those cases, one which people might already be used to for other services such as web and email: point M: at "\\music\share", and give "music" a DNS CNAME record pointing at korma (later to be changed to dhal without anyone noticing).

3. Ah, this reminds me of a section of the "Compatibility Death Matrix" I had on the big whiteboard at Akai - namely dealing with how the database understood paths using various combinations of OSX, Windows, and Linux OSes with FAT, HFS+ and NTFS file systems, and then additional evil for international characters in the paths, file names, ID3 tags (created by many different applications) etc, etc. At one point the VP of engineering was paying a visit, saw this portion of the matrix, and told me that I was completely insane. I then pointed out that there was not a single combination displayed on the whiteboard which had not at one point resulted in a showstopping bug. I hate my life.

4. Unfortunately, GetFinalPathNameByHandle isn't a solution either. It fails to resolve symlinks if a handle can't be obtained (for example, create a symlink to c:\pagefile.sys and then try to find where it resolves to).

5. Looking at the end this function, I wonder about "/** Canonicalise case and 8.3-ness */" - there are reports on the net that one should do GetShortPathNameW() +GetLongPathNameW() in order to get proper case-adjusted for subpaths longer than 8.3. Others suggest a call to FindFilesW. Do you have an opinion ?

6. This code has a buffer overrun bug. The check for drive letters with \\?\UNC\ paths may go out of bounds if the string is 10 characters long. Specifically, the line "if (utf16[9] == L':' && utf16[10] == L'\\')" should not try to access index 10 since we did not first validate that the string has 11 characters. There may be more issues like this, I haven't looked very closely yet.