Processing modules at paths that contain extended unicode characters on windows fails
As far as I can determine, binaryen always fails to open files that contain non-ascii (or perhaps non-latin-1 or some other windowsy encoding) in their path on windows. This is probably because there is no explicit "wide char" support in the code, which is how windows represents unicode, as utf-16.
I have tried to process files via unicode paths with a self-built wasm-opt.exe within the developer command prompt, inside cygwin's mintty, and programmatically (from rust bindings):
cygwin:
$ bin/wasm-opt.exe ../..//hello_world❤️.wasm -o ../..//hello_world_opt.wasm
Failed opening '../..//hello_world??.wasm'
command prompt:
c:\cygwin64\home\Brian\binaryen\build>bin\wasm-opt.exe ..\..\hello_world❤️.wasm -o ..\..\hello_world_opt.wasm
Failed opening '..\..\hello_world??.wasm'
Fixing this seems like it would require all paths to be hidden behind some platform-specific typedefs that resolve to WCHAR on windows, but I don't have much experience here.
cc https://github.com/brson/wasm-opt-rs/issues/40
Looks like we could use std::filesystem::path since we're on C++17.
filesystem::path looks like the right way to go and should just work with the standard file i/o methods.
From googling I don't see a standardized abstraction for getting wide char arguments from main. Seems like dealing with that and then sorting out which arguments are strings and which are paths could be messy.
Is this something the maintainers would be open to accepting patches for? I could see it being fairly disruptive to the codebase. It might be done in two steps: adjusting the internals to be unicode-correct on windows, then adapting all the main functions.
I think this would be worth accepting patches for. I can imagine the changes would be disruptive in the parts of the code base that deal with file paths, but I think those parts are fairly small overall, so it should be ok.
I have made progress on this here: https://github.com/brson/binaryen/tree/unicode
The wasm-opt in that branch can input and output to unicode paths on windows. I still need to convert all the other bins, do some cleanup, write some tests. It'll be a while yet.
The changes necessary are not all that invasive. As structured now the bins have to do some unicode conversions for path types, though it might be possible to hide those conversions inside the command-line Options type.
There are two PRs open related to this issue:
- https://github.com/WebAssembly/binaryen/pull/5514 - this is my PR. It adopts filesystem::path, and changes the main function to reencode the main arguments on windows
- https://github.com/WebAssembly/binaryen/pull/5627 - this one just does the incremental step of using filesystem::path in the I/O APIs