assemblyscript
assemblyscript copied to clipboard
String.UTF8.decode accesses out of bounds memory on invalid input
When passing an invalid UTF8 string into String.UTF8.decode, I expect it to return an empty string or a string containing the bytes up to the invalid part. In practice, the following code leads to out of bounds memory accesses:
let b = new Uint8Array(10001);
let payloadString = String.UTF8.decode(b.buffer);
And the following unreachable:
let b = new Uint8Array(10001);
b[10001] = 0x0a;
let payloadString = String.UTF8.decode(b.buffer);
I noticed this bug first when allocating an ArrayBuffer externally using __new that is just a few bytes long and ends in 0x0a, I always got an out of bounds memory access when trying to convert it to a string from withing assemblyscript. The code above is able to reproduce the same problem, though it is slightly different from the way I actually found it.
All of this was tested using wamr, so it's possible (though unlikely) that this is a VM bug. If required, I can probably create a small C++ program using wamr that is able to reproduce this the way I first found it, though I believe it should be possible to reproduce this with my examples above in any VM.
For reference, the code in question is here:
https://github.com/AssemblyScript/assemblyscript/blob/e5fb10b476f82e40485a661109d2ff53eb768b07/std/assembly/string.ts#L761-L809
On a first glimpse, this seems to be missing checks (is it?).
And the following unreachable:
let b = new Uint8Array(10001); b[10001] = 0x0a; let payloadString = String.UTF8.decode(b.buffer);
That's expected due to OOB in b array. You're access to 10001 index while max possible index is 10000
When passing an invalid UTF8 string into String.UTF8.decode, I expect it to return an empty string or a string containing the bytes up to the invalid part. In practice, the following code leads to out of bounds memory accesses:
What kind of invalid string? String with length which larger than Uint8Array's size or invalid UTF16 codepoint?
And the following unreachable:
let b = new Uint8Array(10001); b[10001] = 0x0a; let payloadString = String.UTF8.decode(b.buffer);
That's expected due to OOB in
barray. You're access to10001index while max possible index is10000
You're totally right. When replacing 10001 with 10000, the error disappears.
On a first glimpse, this seems to be missing checks (is it?).
That's what I'm thinking as well.
When passing an invalid UTF8 string into String.UTF8.decode, I expect it to return an empty string or a string containing the bytes up to the invalid part. In practice, the following code leads to out of bounds memory accesses:
What kind of invalid string? String with length which larger than
Uint8Array's size or invalid UTF16 codepoint?
I think it's the latter, I'm currently experimenting around a bit to make it more reproducable. One (other?) thing I noticed is that the following code:
let b = new Uint8Array(32 * 64 * 3);
for(let i = 0; i < b.length; ++i) {
b[i] = u8(i);
}
let payloadString = String.UTF8.decode(b.buffer);
Doesn't actually produce a valid UTF16 string, which is kind of expected in a way, but is also weird; I feel like invalid utf8 strings should cause errors or empty strings or something similar.
Okay, so here are the things I learned:
- It's easiest to crash if the buffer is allocated from the outside using __new
- The following code always crashes for me on the first execution, but not on the later ones:
let b = new Uint8Array(1024 * 64);
b[b.length - 1] = 0xFF;
let payloadString = String.UTF8.decode(b.buffer);
This does seem to me like a missing out of bounds check in the UTF8 string parser, the crashes depend on where the buffer is in memory. I assume that there are no checks upon accessing elements in an ArrayBuffer in UTF8 decoding, so sometimes it just runs past the buffer and sees something else, and sometimes it runs past the buffer crashes as there is nothing else.
I looked at decode meanwhile and am not seeing where it would access out of bounds. Each load is guarded by either while (bufOff < bufEnd) or if (bufEnd == bufOff) break;. Maximum store range is correct. Also tried the reproduction snippet, including allocating differently sized strings, with and without GC collects in between, incl. additional defensive asserts added to decode, without encountering any issues. As such it seems the issue is something else.
I just tinkered around with it a bit more, seems like you're right, the problem is elsewhere. Running the following code repeatedly produces the same (sporadic) errors:
let b = new Uint8Array(1024 * 64);
b[b.length - 1] = 0xFF;
This seemed very weird to me, so I checked all the embedding code, but still no errors. Then I updated my version of wamr to the latest release, which is WAMR-1.1.0, and now everything works! It seems like I had a VM bug.
Sorry that I bothered you, seems like there is nothing wrong with assemblyscript. I should have checked if the problem persists in other VMs, but I just didn't think it could be that. Thank you for trying to help me and fix this :).
For future reference, my previous WAMR version was commit db210fbc66ca779a7bc6748cd340cf5eab3de6d4.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in one week if no further activity occurs. Thank you for your contributions!