assemblyscript String.UTF8.decode accesses out of bounds memory on invalid input

When passing an invalid UTF8 string into String.UTF8.decode, I expect it to return an empty string or a string containing the bytes up to the invalid part. In practice, the following code leads to out of bounds memory accesses:

let b = new Uint8Array(10001);
let payloadString = String.UTF8.decode(b.buffer);

And the following unreachable:

let b = new Uint8Array(10001);
b[10001] = 0x0a;
let payloadString = String.UTF8.decode(b.buffer);

I noticed this bug first when allocating an ArrayBuffer externally using __new that is just a few bytes long and ends in 0x0a, I always got an out of bounds memory access when trying to convert it to a string from withing assemblyscript. The code above is able to reproduce the same problem, though it is slightly different from the way I actually found it.

All of this was tested using wamr, so it's possible (though unlikely) that this is a VM bug. If required, I can probably create a small C++ program using wamr that is able to reproduce this the way I first found it, though I believe it should be possible to reproduce this with my examples above in any VM.

Sep 04 '22 10:09 nioev

For reference, the code in question is here:

https://github.com/AssemblyScript/assemblyscript/blob/e5fb10b476f82e40485a661109d2ff53eb768b07/std/assembly/string.ts#L761-L809

On a first glimpse, this seems to be missing checks (is it?).

Sep 04 '22 10:09 dcodeIO

And the following unreachable:

let b = new Uint8Array(10001); b[10001] = 0x0a; let payloadString = String.UTF8.decode(b.buffer);

That's expected due to OOB in b array. You're access to 10001 index while max possible index is 10000

Sep 04 '22 10:09 MaxGraey

When passing an invalid UTF8 string into String.UTF8.decode, I expect it to return an empty string or a string containing the bytes up to the invalid part. In practice, the following code leads to out of bounds memory accesses:

What kind of invalid string? String with length which larger than Uint8Array's size or invalid UTF16 codepoint?

Sep 04 '22 10:09 MaxGraey

And the following unreachable:

let b = new Uint8Array(10001); b[10001] = 0x0a; let payloadString = String.UTF8.decode(b.buffer);

That's expected due to OOB in b array. You're access to 10001 index while max possible index is 10000

You're totally right. When replacing 10001 with 10000, the error disappears.

Sep 04 '22 13:09 nioev

On a first glimpse, this seems to be missing checks (is it?).

That's what I'm thinking as well.

Sep 04 '22 13:09 nioev

When passing an invalid UTF8 string into String.UTF8.decode, I expect it to return an empty string or a string containing the bytes up to the invalid part. In practice, the following code leads to out of bounds memory accesses:

What kind of invalid string? String with length which larger than Uint8Array's size or invalid UTF16 codepoint?

I think it's the latter, I'm currently experimenting around a bit to make it more reproducable. One (other?) thing I noticed is that the following code:

    let b = new Uint8Array(32 * 64 * 3);
    for(let i = 0; i < b.length; ++i) {
        b[i] = u8(i);
    }
    let payloadString = String.UTF8.decode(b.buffer);

Doesn't actually produce a valid UTF16 string, which is kind of expected in a way, but is also weird; I feel like invalid utf8 strings should cause errors or empty strings or something similar.

Sep 04 '22 13:09 nioev

Okay, so here are the things I learned:

It's easiest to crash if the buffer is allocated from the outside using __new
The following code always crashes for me on the first execution, but not on the later ones:

    let b = new Uint8Array(1024 * 64);
    b[b.length - 1] = 0xFF;
    let payloadString = String.UTF8.decode(b.buffer);

This does seem to me like a missing out of bounds check in the UTF8 string parser, the crashes depend on where the buffer is in memory. I assume that there are no checks upon accessing elements in an ArrayBuffer in UTF8 decoding, so sometimes it just runs past the buffer and sees something else, and sometimes it runs past the buffer crashes as there is nothing else.

Sep 04 '22 13:09 nioev

I looked at decode meanwhile and am not seeing where it would access out of bounds. Each load is guarded by either while (bufOff < bufEnd) or if (bufEnd == bufOff) break;. Maximum store range is correct. Also tried the reproduction snippet, including allocating differently sized strings, with and without GC collects in between, incl. additional defensive asserts added to decode, without encountering any issues. As such it seems the issue is something else.

Oct 05 '22 01:10 dcodeIO

I just tinkered around with it a bit more, seems like you're right, the problem is elsewhere. Running the following code repeatedly produces the same (sporadic) errors:

    let b = new Uint8Array(1024 * 64);
    b[b.length - 1] = 0xFF;

This seemed very weird to me, so I checked all the embedding code, but still no errors. Then I updated my version of wamr to the latest release, which is WAMR-1.1.0, and now everything works! It seems like I had a VM bug.

Sorry that I bothered you, seems like there is nothing wrong with assemblyscript. I should have checked if the problem persists in other VMs, but I just didn't think it could be that. Thank you for trying to help me and fix this :).

For future reference, my previous WAMR version was commit db210fbc66ca779a7bc6748cd340cf5eab3de6d4.

Oct 05 '22 06:10 VayuDev

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in one week if no further activity occurs. Thank you for your contributions!

Nov 04 '22 23:11 github-actions[bot]

assemblyscript assemblyscript copied to clipboard

String.UTF8.decode accesses out of bounds memory on invalid input

assemblyscript
assemblyscript copied to clipboard