asar icon indicating copy to clipboard operation
asar copied to clipboard

Handling of utf-8 path names.

Open Horrowind opened this issue 7 years ago • 10 comments
trafficstars

Is this supported? Should this be supported? How would one do this on Windows?

I tested Asar in Wine and while it could open files which I specified on the command line, it did not want to open incsrc'ed utf-8 files and wrote wrong file names when an error occured (/tmp/Σ÷ⁿ.asm:1: error: Unknown command. [├ñ├╢├╝] instead of /tmp/äöü.asm:1: error: Unknown command. [äöü])

I need this one for the normalization routine we talked about in the other thread. I can not simply do operations on single characters if the string could be utf-8.

Horrowind avatar Apr 13 '18 22:04 Horrowind

Yeah, I assumed this problem would occur at some point (but forgot to make an issue for it). Have already had to fix this issue in games I've worked on in the past, where some users couldn't save their game data because their Windows user path contained Unicode characters.

I'm assuming this problem doesn't exist on Linux, since Linux uses UTF-8 as its native encoding, so as long as patches also contain UTF-8, everything should be fine. The native encoding of Windows, on the other hand, is UTF-16, but by default, it only uses ANSI-compatible functions. UTF-16 functions have to be used explicitly for Unicode-compatibility. fopen() is no such function on Windows (a quick Google search shows that you can somehow enable UTF-8 encoding for fopen(), but that only seems to be for opening files in text mode, not for handling file paths themselves). I assume this means we need to use _wfopen() on Windows in all places where fopen() is currently used. It also means we have to convert passes from UTF-8 to UTF-16 whenever opening files, but that shouldn't be too difficult.

I'm not sure if there's any other function to account for. I think the file_exists() function has a platform-specific implementation, which might not use a Unicode-supported function yet. Besides that, I'm not sure if anything else needs to be changed.

As for path normalization, I assume the easiest solution here would be to convert the UTF-8 string to a UTF-32 string, that way there would be no multi-byte sequences.

Or actually... you know what, I don't even think that's necessary, since . / and \ are still the same byte values in UTF-8 and can never appear inside a multi-byte sequence in a well-formed UTF-8 string. So in reality, for normalization, just treating the path like an ANSI path should be totally fine, I don't think it should break anything, even when we add proper Unicode support some time in the future (which I think should definitely be done).

RPGHacker avatar Apr 13 '18 23:04 RPGHacker

You seem to be right about '' -> '/', but my problem is really tolower. I am pretty sure there is some functionality in Windows for this, but I could not find it.

The thing I could find was path normalization if the path exists. There is a lot of craziness going on, see here: http://pdh11.blogspot.de/2009/05/pathcanonicalize-versus-what-it-says-on.html

Horrowind avatar Apr 14 '18 08:04 Horrowind

Oh, I get your point.

Maybe we just leave out the lowercase conversion then. Of course it would be useful to have it, but one could argue that if someone calls a file "äpfel.asm" and then puts "incsrc Äpfel.asm" in another file, it's an error, anyways (and thinking about it, that file would also fail to compile entirely on Linux). Thinking about the Linux-incompatibility now, I suppose some sanity checking (with a warning) would actually be even better than converting to lowercase silently, but I think even sanity-checking like that would require a lowercase conversion, so for the sake of keeping this simple, let's just say we don't do the lowercase conversion and hope people are smart enough to not mix different cases.

RPGHacker avatar Apr 14 '18 10:04 RPGHacker

Well we can also force the use of case sensitive paths with CreateFile and the FILE_FLAG_POSIX_SEMANTICS flag, see https://msdn.microsoft.com/en-us/library/aa363858(v=vs.85).aspx

Quoting MSDN here:

Use care when using this option, because files created with this flag may not be accessible by applications that are written for MS-DOS or 16-bit Windows.

:)

Horrowind avatar Apr 14 '18 10:04 Horrowind

Sounds like a good and reasonable idea. Noone cares about MS-DOS or 16-bit Windows, anyways. It means that _wfopen() won't be an option and that we'll have to refactor slightly more to make use of CreateFileW(), but it still shouldn't turn out too bad, I think.

RPGHacker avatar Apr 14 '18 11:04 RPGHacker

So should I take it that you are currently working on the path normalization stuff? Just asking so that I don't accidentally work on the same stuff in parallel.

RPGHacker avatar Apr 15 '18 13:04 RPGHacker

If you want to take a look, this is how I implemented normalization (in platform/file_helpers.cpp). This seemed to work okayish, sadly, my whole implementation of memory files is now in the fun state of failing tests randomly with the added spice of working in the debugger...

If it helps, you can use this routine for whatever you want to with it (include guards, I suppose?)

//TODO(Horrowind): This still needs to implement lookup of symlinks.
string normalize_path(const char* path, const char* base_path) {
    string combined;
    if(path_is_absolute(path))
    {
        combined = string(path);
    }
    else
    {
        if(!path_is_absolute(base_path)) return string{ };
        combined = create_combined_path(base_path, path);
    }
    
    char path_seperator = get_native_path_separator();

    int length = combined.truelen();

    char* begin = combined.str;
    char* end   = begin + length;
    char* in    = begin;
    char* out   = begin;

    while (in < end)
    {
        if (in + 1 == end && in[1] == '.')
        {
            break;
        }
        else if (in + 2 == end && in[1] == '.' && in[2] == '.')
        {
            // Move out backwards
            for (char* s = out - 1; s >= begin; s--)
            {
                if (*s == path_seperator || *s == '/')
                {
                    out = s;
                    break;
                }
            }
            break;
        }
        else if(in + 2 <= end && in[1] == '.' && in[2] == path_seperator)
        {
            in += 2;
        }
        else if(in + 3 <= end && in[1] == '.' && in[2] == '.' && in[3] == path_seperator)
        {
            // Move out backwards
            for(char* s = out - 1; s >= begin; s--)
            {
                if(*s == path_seperator || *s == '/')
                {
                    out = s;
                    break;
                }
            }
            in += 3;
        }
        else
        {
            do
            {
                *out++ = *in++;
            }
            while(in < end && *in != path_seperator);
        }
    }

    combined.str[out - begin] = '\0';
    return combined;
}

Horrowind avatar Apr 15 '18 19:04 Horrowind

One minor thing I'm noticing is that the final two else ifs only check for path_seperator, not for /, which means it will probably fail on Windows with a path that uses ../ instead of ..\, for example. That's why in my first comment, I suggested replacing all \ by / first (it makes the path folding part a bit easier, since you don't have to compare against the native path separator and can compare against / only).

Anyways, not seeing anything obviously broken in the code. If it fails outside of the debugger only, it might be a timing-related issue (although I don't know how that could be the case here, since we don't even use multi-threading). Another idea: maybe instead of char* out = begin;, try using string out; and write into an entirely new string. Not sure if that could be the case here, but in theory, I can imagine some stuff going wrong when input string and output string are identical, and since we don't really have to care about memory management in Asar, just creating a new string should be fine and (probably) safer. That's what I would try.

And yep, I need path normalization for include guards - that seems like the only somewhat reliable way of storing which files you've already opened.

RPGHacker avatar Apr 15 '18 20:04 RPGHacker

No, it is the other code I wrote to implement memory files that is buggy, this one seems to be fine. And you are right about \ vs /. The fixed version will be in the pull request.

Horrowind avatar Apr 15 '18 20:04 Horrowind

Oh, I see. So how are you testing them? I'm assuming you worked them into the test suite?

RPGHacker avatar Apr 15 '18 20:04 RPGHacker

You may now use poop emoji file names on Windows. Please enjoy this feature. Closing.

p4plus2 avatar Jan 21 '24 01:01 p4plus2