wasmtime icon indicating copy to clipboard operation
wasmtime copied to clipboard

Invalid UTF-8 was detected in one or more arguments

Open quasis opened this issue 1 year ago • 11 comments

Running a WASM file with a binary argument, such as:

wasmtime run hmac_md5 --key `printf "\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa"` "String to hash"

or

wasmtime run hmac_md5 -- --key `printf "\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa"` "String to hash"

Aborts execution with the following error:

error: Invalid UTF-8 was detected in one or more arguments

I see three problems with this:

  1. There is no standard I'm aware of that requires the arguments to be textual
  2. There is no standard that requires them to be UTF-8 encoded
  3. Wasmtime is doing unneeded work while validating the arguments, and/or converting them to UTF-8

quasis avatar Sep 24 '22 21:09 quasis

Arguments are text, not binary data. For example on Windows it has to transcode from UTF-16 (as provided by the OS) to UTF-8 (as expected by wasi programs) Additionally null bytes are not allowed in arguments on Windows and Unix as they indicate the end of the argument (on Unix) or the end of the entire commandline (on Windows).

bjorn3 avatar Sep 24 '22 22:09 bjorn3

I believe we have to differentiate between two types of arguments here: arguments to wasmtime itself and arguments to the executed program. The program in its minimal variant will look like this:

int
main(int argc, char *argv[]) {
    // parse 2nd type of arguments here and convert them to whatever encoding necessary,
    // or fail if encoding doesn't fit the design
}

Right now wasmtime is forcing UTF-8 arguments on the program itself, which is IMHO quite limiting for the program (if done by design).

quasis avatar Sep 24 '22 22:09 quasis

If you want the program to run on Windows, UTF-16 to UTF-8 conversion has to be done one way or another as wasi programs don't know the host architecture.

bjorn3 avatar Sep 25 '22 06:09 bjorn3

Also, command-line arguments are already not suitable for passing arbitrary binary data, because they're passed to C programs as NUL-terminated strings, so they can't contain embedded 0 bytes. If you want to pass arbitrary binary data as arguments to programs today, I suggest using an escape or quoting system, or a format like BASE64, to ensure that all the bytes are preserved.

sunfishcode avatar Sep 25 '22 18:09 sunfishcode

I believe there's a bug here where the wasmtime executable shouldn't panic on invalid-utf-8, but otherwise I agree with others that your program shouldn't rely on being able to receive arbitrary bytes as arguments.

Additionally the wasmtime CLI has no way of actually communicating this argument to the program at this time. The wasmtime cli can only invoke functions which take numbers as input, which means that even if this were accepted it wouldn't end up doing anything.

alexcrichton avatar Sep 26 '22 14:09 alexcrichton

I believe we're concentrating on the wrong side of the coin. Binary arguments are useful for unit tests, where you fully control the environment, know the limitations, and want as little opcodes as possible between the essence of the test and the environment.

The other side of the coin is the forceful conversion of arguments to the UTF-8 encoding, which is a rather large constraint IMHO considering the zoo of character encodings out there. Perhaps the problem with UTF-16 on Windows, which as I understand is the main reason for conversion to UTF-8, can be solved some other way?

p.s. There is a relevant discussion on the SG-16 group page: https://github.com/sg16-unicode/sg16/issues/66, it seems they aim to solve the same problem that we're facing?

quasis avatar Sep 26 '22 17:09 quasis

Binary arguments are useful for unit tests, where you fully control the environment, know the limitations, and want as little opcodes as possible between the essence of the test and the environment.

On Windows, binary arguments would have to have a multiple-of-two size as the argument list consist of a single 16bit character list. In addition on Windows the program itself it responsible for parsing the argument list from a single string into however many arguments the user intended to parse. This is not the responsibility of the shell like on Unix. This means that if your binary argument contains a space character, it would either cause two arguments to be passed, or you would need to escape it, which doesn't really work well with binary arguments I would think.

The other side of the coin is the forceful conversion of arguments to the UTF-8 encoding, which is a rather large constraint IMHO considering the zoo of character encodings out there.

How would you suggest making wasi programs agnostic to the character encoding without forcefully converting from and to UTF-8 at the boundary between the wasi program and the runtime? On Windows the A variants of all system api's have a behavior depending on the system character set, which wasi programs can't query and even if it could, now wasi programs would need to ship with huge tables translating between every possible character set and whatever encoding it uses internally (like UTF-8).

bjorn3 avatar Sep 26 '22 17:09 bjorn3

How would you suggest making wasi programs agnostic to the character encoding without forcefully converting from and to UTF-8 at the boundary between the wasi program and the runtime?

Perhaps a flag to wasmtime, which defaults to UTF-8? Something like:

wasmtime --default-encoding=[system|utf-8|...]

This was the first thing I was looking for after facing the problem the first time. It might not be perfect, but at least intuitive. :)

quasis avatar Sep 26 '22 18:09 quasis

I believe we're concentrating on the wrong side of the coin. Binary arguments are useful for unit tests, where you fully control the environment, know the limitations, and want as little opcodes as possible between the essence of the test and the environment.

For a truly minimal unit test, could you use the embedding API, instead of the CLI? Command-line arguments involve quite a lot of code, on both the host side and in wasm, even ignoring the character encoding issues. With the embedding API, you can have a function that takes an array of u8 directly, which would avoid the whole question about string encodings, and avoid all the command-line argument code too.

sunfishcode avatar Sep 26 '22 18:09 sunfishcode

For a truly minimal unit test, could you use the embedding API, instead of the CLI? Command-line arguments involve quite a lot of code, on both the host side and in wasm, even ignoring the character encoding issues. With the embedding API, you can have a function that takes an array of u8 directly, which would avoid the whole question about string encodings, and avoid all the command-line argument code too.

I definitely agree with your reasoning. My only comment on that, is that Linux CLI is based on a mature code-base and provides a "battle-proven" solution for almost any upstream/downstream problem, while embedding API almost always lacks an "in-house" solution and requires a third partly library, that is generally either not mature enough or not maintained for years. So there are advantages and disadvantages in both approaches.

quasis avatar Sep 26 '22 21:09 quasis

I'm a little unclear about what you're referring to. When I say embedding API, II mean Wasmtime's embedding API: https://docs.wasmtime.dev/lang.html

I can also mention that the current direction for WASI is that it will define command-line arguments to be Unicode strings. While it may be useful in some unit-test-like use cases to let users that know how the host environment works to take advantage of it, the goal in many real-world wasm use cases is to have modules that don't know the environment, and don't interact with the "system" character encoding, because such knowledge would make them less portable.

This only applies to things which are called "strings", which includes command-line arguments. If you have APIs that take u8s, you'll always be able to pass arbitrary bytes through.

sunfishcode avatar Sep 27 '22 16:09 sunfishcode