teavm icon indicating copy to clipboard operation
teavm copied to clipboard

Memory corruption when loading resource metadata in WebAssembly

Open dicej opened this issue 2 years ago • 7 comments

When using the WebAssembly target, I've found that using resource metadata (e.g. Unicode tables) causes the subsequent GC to fail. Specifically, GC.sweep either loops infinitely or makes an out-of-bounds memory access.

For example, this code will cause an infinite loop when compiled to Wasm and run:

public class Main {
    public static void main(String[] args) {
        Character.getNumericValue((int) '1');
        System.gc();
    }
}

I've spent some time debugging this, and found that it is related to resource objects created by MetadataIntrinsic and stored in the data segment of the .wasm file. At runtime, references are created to those objects on the stack and/or heap, which confuses GC, leading to the behavior described above.

The following patch to GC.java fixes the issue, although I'm not sure it's an ideal solution:

     private static boolean isMarked(RuntimeObject object) {
-        return (object.classReference & RuntimeObject.GC_MARKED) != 0
+        return object.toAddress().isLessThan(heapAddress())
+                || (object.classReference & RuntimeObject.GC_MARKED) != 0
                 || (!isFullGC && (object.classReference & RuntimeObject.GC_OLD_GENERATION) != 0);
     }

dicej avatar Oct 31 '22 22:10 dicej

This is not a good solution:

  1. There's also C backend, which does not guarantee, that data segment address is less than heap address.
  2. GC is performance critical part, and all extra checks should be avoided.

The proper solution is to find out, why isMarked is called for such objects. There are three options:

  1. It's a GC stack root.
  2. It's referenced from heap object.
  3. It's referenced from heap array.

All three should be carefully avoided. To prevent resource object being a stack root, TeaVM should check which references have resource types and prevent them from adding to shadow stack. As for heap objects, TeaVM should check field types, and prevent fields, referencing resources, from being added to GC fields offsets in class metadata. The latest case is simplest: creation of resource arrays should be avoided in the code. To help that, it's possible to write an analysis that reports error when hits corresponding case in the code.

konsoletyper avatar Nov 01 '22 07:11 konsoletyper

Pushed commit that attempts to fix the issue. Please, check

konsoletyper avatar Nov 01 '22 08:11 konsoletyper

Pushed commit that attempts to fix the issue. Please, check

Thanks. I just tried your commit, but the example I gave above (Character.getNumericValue followed by System.gc) still causes an infinite loop in GC.sweep.

For reference, I'm using the Main.java I gave above, plus this index.html:

<!DOCTYPE html>
<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html;charset=utf-8">
    <title>foo</title>
    <script src="classes.wasm-runtime.js" type="text/javascript"></script>
  </head>
  <body>
  </body>
</html>

Then, in my web browser (I'm using Firefox, but it should be reproducible in any browser), I load the page and run this from the developer console: await TeaVM.wasm.load("classes.wasm").then(teavm => teavm.main()). That never finishes, and eventually Firefox times it out with the resulting stack trace:

Script terminated by timeout at:
meth_otr_GC_objectSize@http://localhost:8000/classes.wasm-runtime.js line 194 > WebAssembly.instantiate:wasm-function[77]:0x29fc
meth_otr_GC_sweep@http://localhost:8000/classes.wasm-runtime.js line 194 > WebAssembly.instantiate:wasm-function[41]:0x13eb
meth_otr_GC_doCollectGarbage@http://localhost:8000/classes.wasm-runtime.js line 194 > WebAssembly.instantiate:wasm-function[17]:0x921
meth_otr_GC_collectGarbageImpl@http://localhost:8000/classes.wasm-runtime.js line 194 > WebAssembly.instantiate:wasm-function[16]:0x86e
meth_otr_GC_collectGarbageFullImpl@http://localhost:8000/classes.wasm-runtime.js line 194 > WebAssembly.instantiate:wasm-function[14]:0x832
teavm_gc_collectFull@http://localhost:8000/classes.wasm-runtime.js line 194 > WebAssembly.instantiate:wasm-function[13]:0x826
meth_jl_System_gcLowLevel@http://localhost:8000/classes.wasm-runtime.js line 194 > WebAssembly.instantiate:wasm-function[92]:0x336b
meth_jl_System_gc@http://localhost:8000/classes.wasm-runtime.js line 194 > WebAssembly.instantiate:wasm-function[91]:0x3364
meth_Main_main@http://localhost:8000/classes.wasm-runtime.js line 194 > WebAssembly.instantiate:wasm-function[189]:0x6089
meth_otr_Fiber_lambda_startMain_0@http://localhost:8000/classes.wasm-runtime.js line 194 > WebAssembly.instantiate:wasm-function[180]:0x5f00
meth_otr_Fiber_startMain_lambda__25_0_run@http://localhost:8000/classes.wasm-runtime.js line 194 > WebAssembly.instantiate:wasm-function[154]:0x4e95
meth_otr_Fiber_start_0@http://localhost:8000/classes.wasm-runtime.js line 194 > WebAssembly.instantiate:wasm-function[179]:0x5e5c
meth_otr_Fiber_start@http://localhost:8000/classes.wasm-runtime.js line 194 > WebAssembly.instantiate:wasm-function[177]:0x5d8a
meth_otr_Fiber_startMain@http://localhost:8000/classes.wasm-runtime.js line 194 > WebAssembly.instantiate:wasm-function[178]:0x5dfd
teavm_start@http://localhost:8000/classes.wasm-runtime.js line 194 > WebAssembly.instantiate:wasm-function[288]:0x6ac6
wrapExport/TeaVM.wasm@http://localhost:8000/classes.wasm-runtime.js:147:29
createMain/TeaVM.wasm/TeaVM.wasm<@http://localhost:8000/classes.wasm-runtime.js:223:73
createMain/TeaVM.wasm@http://localhost:8000/classes.wasm-runtime.js:207:20
@debugger eval code:2:62
promise callback*@debugger eval code:2:42
@debugger eval code:3:3

dicej avatar Nov 01 '22 16:11 dicej

Ok, tried another attempt

konsoletyper avatar Nov 04 '22 13:11 konsoletyper

Ok, tried another attempt

Thanks again. That seems to have fixed the test case I gave above. However, here's a more complicated example which is still broken: https://github.com/dicej/teavm-memory/blob/ae74daff091d44c66a7e9c92d0f6203655db4b36/src/main/java/Main.java

I'm running that example using await TeaVM.wasm.load("classes.wasm").then(teavm => teavm.main(["1000"])), and it works fine for about the first 180 iterations, but then crashes immediately after the first GC with an invalid table access. Note that it doesn't seem to crash during GC this time -- just afterward. So perhaps you did fix the original issue, but it seems that the fix may have introduced a regression which affects the Wasm function table(s).

dicej avatar Nov 04 '22 20:11 dicej

Then I don't know. But I can't accept your patch, since it's incorrect. C backend uses the same GC and works without problems. So the problem is not in GC, but somewhere around it. My advice: please, look at assertions C backend inserts into code to check GC (I think there was some switch or system property to enable generation of these assertions) and try to recreate similar in Wasm. Without it it's almost impossible to debug GC. Personally, I don't have time and motivation to find bugs in Wasm BE.

konsoletyper avatar Nov 04 '22 22:11 konsoletyper

No worries -- thanks for looking into it. I'll post a PR if I can come up with a better solution.

dicej avatar Nov 04 '22 23:11 dicej

Not sure if further errors are caused by resource loading. I started to port your changes to this repository. I think that heap is corrupted by interop. The pattern with Address.of(array) is slightly misused there. In my C runtime implementation I use this technique to fill a byte buffer, so pass address of data to the buffer, and then read from Java array. But WASI support uses acquired pointer to traverse the results, which is not correct. Since managed array is not used after the call, compiler can remove it from stack roots and thus make it relocatable. So I implement Address.pin method to make sure that array is not relocated until native pointer is in use, and check if it helps. Of course, the proper way to debug GC is to implement assertions - this should not be very hard.

konsoletyper avatar Nov 08 '22 17:11 konsoletyper

Thanks for pointing that out -- Address.pin sounds useful. And if that doesn't help, I can also switch back to allocating buffers using Memory.malloc and copying the array contents into those buffers before traversing.

BTW, the "invalid table" errors I mentioned above were happening on the master branch of this repository (which does not yet have any WASI code in it).

dicej avatar Nov 08 '22 18:11 dicej

Looks like I fixed the issue. Both examples (Character.getNumericValue and PI) work fine for me.

konsoletyper avatar Nov 11 '22 20:11 konsoletyper

Thanks, @konsoletyper! This issue appears to be fixed, although the pi digits test still fails after about 7660 iterations due to https://github.com/konsoletyper/teavm/issues/634.

dicej avatar Nov 15 '22 23:11 dicej