emscripten icon indicating copy to clipboard operation
emscripten copied to clipboard

test_pthread_proxying_canceled_work: RuntimeError: operation does not support unaligned accesses

Open sbc100 opened this issue 8 months ago • 2 comments

I've seen this a few times in CI recently.

======================================================================
FAIL: test_pthread_proxying_canceled_work (test_core.lsan)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/chrome-bot/.cache/vpython-root.1000/store/cpython+tkj8tpr2n0oj8m0t270p0m71s0/contents/lib/python3.8/unittest/case.py", line 60, in testPartExecutor
    yield
  File "/home/chrome-bot/.cache/vpython-root.1000/store/cpython+tkj8tpr2n0oj8m0t270p0m71s0/contents/lib/python3.8/unittest/case.py", line 676, in run
    self._callTestMethod(testMethod)
  File "/home/chrome-bot/.cache/vpython-root.1000/store/cpython+tkj8tpr2n0oj8m0t270p0m71s0/contents/lib/python3.8/unittest/case.py", line 633, in _callTestMethod
    method()
  File "/b/s/w/ir/x/w/install/emscripten/test/common.py", line 348, in decorated
    f(self, *args, **kwargs)
  File "/b/s/w/ir/x/w/install/emscripten/test/test_core.py", line 2484, in test_pthread_proxying_canceled_work
    self.do_run_in_out_file_test(
  File "/b/s/w/ir/x/w/install/emscripten/test/common.py", line 1931, in do_run_in_out_file_test
    output = self._build_and_run(srcfile, expected, **kwargs)
  File "/b/s/w/ir/x/w/install/emscripten/test/common.py", line 1969, in _build_and_run
    js_output = self.run_js(js_file, engine, args,
  File "/b/s/w/ir/x/w/install/emscripten/test/common.py", line 1556, in run_js
    self.fail('JS subprocess failed (%s): %s (expected=%s).  Output:\n%s' % (error.cmd, error.returncode, assert_returncode, ret))
  File "/home/chrome-bot/.cache/vpython-root.1000/store/cpython+tkj8tpr2n0oj8m0t270p0m71s0/contents/lib/python3.8/unittest/case.py", line 753, in fail
    raise self.failureException(msg)
AssertionError: JS subprocess failed (/b/s/w/ir/cache/builder/emscripten-releases/node-v20.14.0-linux-x64/bin/node --stack-trace-limit=50 --trace-uncaught /b/s/w/ir/x/t/emtest_jarzrh1c/emscripten_test_lsan_7vafnkyn/test_pthread_proxying_canceled_work.js): 1 (expected=0).  Output:
testing cancel followed by proxy
wasm://wasm/test_pthread_proxying_canceled_work.wasm-0010c28e:1


RuntimeError: operation does not support unaligned accesses
    at test_pthread_proxying_canceled_work.wasm.a_cas (wasm://wasm/test_pthread_proxying_canceled_work.wasm-0010c28e:wasm-function[186]:0x5205)
    at test_pthread_proxying_canceled_work.wasm.__private_cond_signal (wasm://wasm/test_pthread_proxying_canceled_work.wasm-0010c28e:wasm-function[192]:0x52a9)
    at test_pthread_proxying_canceled_work.wasm.pthread_cond_signal (wasm://wasm/test_pthread_proxying_canceled_work.wasm-0010c28e:wasm-function[174]:0x4ade)
    at test_pthread_proxying_canceled_work.wasm.emscripten_proxy_finish (wasm://wasm/test_pthread_proxying_canceled_work.wasm-0010c28e:wasm-function[128]:0x3c87)
    at test_pthread_proxying_canceled_work.wasm.call_then_finish_task (wasm://wasm/test_pthread_proxying_canceled_work.wasm-0010c28e:wasm-function[141]:0x422d)
    at test_pthread_proxying_canceled_work.wasm.call_with_ctx (wasm://wasm/test_pthread_proxying_canceled_work.wasm-0010c28e:wasm-function[136]:0x40de)
    at test_pthread_proxying_canceled_work.wasm.em_task_queue_execute (wasm://wasm/test_pthread_proxying_canceled_work.wasm-0010c28e:wasm-function[110]:0x3506)
    at test_pthread_proxying_canceled_work.wasm.receive_notification (wasm://wasm/test_pthread_proxying_canceled_work.wasm-0010c28e:wasm-function[119]:0x3873)
    at test_pthread_proxying_canceled_work.wasm.em_task_queue_execute (wasm://wasm/test_pthread_proxying_canceled_work.wasm-0010c28e:wasm-function[110]:0x3506)
    at test_pthread_proxying_canceled_work.wasm._emscripten_check_mailbox (wasm://wasm/test_pthread_proxying_canceled_work.wasm-0010c28e:wasm-function[291]:0x702d)
    at /b/s/w/ir/x/t/emtest_jarzrh1c/emscripten_test_lsan_7vafnkyn/test_pthread_proxying_canceled_work.js:1117:12
    at callUserCallback (/b/s/w/ir/x/t/emtest_jarzrh1c/emscripten_test_lsan_7vafnkyn/test_pthread_proxying_canceled_work.js:4730:5)
    at checkMailbox (/b/s/w/ir/x/t/emtest_jarzrh1c/emscripten_test_lsan_7vafnkyn/test_pthread_proxying_canceled_work.js:4761:5)
Thrown at:
    at $a_cas (wasm://wasm/test_pthread_proxying_canceled_work.wasm-0010c28e:1:20998)
    at $__private_cond_signal (wasm://wasm/test_pthread_proxying_canceled_work.wasm-0010c28e:1:21162)
    at $pthread_cond_signal (wasm://wasm/test_pthread_proxying_canceled_work.wasm-0010c28e:1:19167)
    at $emscripten_proxy_finish (wasm://wasm/test_pthread_proxying_canceled_work.wasm-0010c28e:1:15496)
    at $call_then_finish_task (wasm://wasm/test_pthread_proxying_canceled_work.wasm-0010c28e:1:16942)
    at $call_with_ctx (wasm://wasm/test_pthread_proxying_canceled_work.wasm-0010c28e:1:16607)
    at $em_task_queue_execute (wasm://wasm/test_pthread_proxying_canceled_work.wasm-0010c28e:1:13575)
    at $receive_notification (wasm://wasm/test_pthread_proxying_canceled_work.wasm-0010c28e:1:14452)
    at $em_task_queue_execute (wasm://wasm/test_pthread_proxying_canceled_work.wasm-0010c28e:1:13575)
    at $_emscripten_check_mailbox (wasm://wasm/test_pthread_proxying_canceled_work.wasm-0010c28e:1:28718)

sbc100 avatar Mar 13 '25 23:03 sbc100

@tlively any idea what this might be?

sbc100 avatar Mar 13 '25 23:03 sbc100

This is pretty weird. It's failing while trying to finish synchronously proxied work on a proxy worker that has already been cancelled. That work should never have been sent to the cancelled proxy worker to begin with.

To send the work, the proxying thread has to increment the refcount on the proxy worker thread's mailbox with emscripten_thread_mailbox_ref:

https://github.com/emscripten-core/emscripten/blob/106a27c1018cf3c6104541eb176b9d0182805841/system/lib/pthread/thread_mailbox.c#L18-L33

But this should fail if the proxy worker thread's mailbox has been closed, which it should have been. _emscripten_thread_mailbox_shutdown synchronously waits for the mailbox to be closed:

https://github.com/emscripten-core/emscripten/blob/106a27c1018cf3c6104541eb176b9d0182805841/system/lib/pthread/thread_mailbox.c#L50-L57

_emscripten_thread_mailbox_shutdown is called as one of the first steps in _emscripten_thread_exit:

https://github.com/emscripten-core/emscripten/blob/106a27c1018cf3c6104541eb176b9d0182805841/system/lib/pthread/pthread_create.c#L307

So either this mailbox closing mechanism is failing somehow and the mailbox is being resurrected, or the thread is not fully cancelled (specifically never got to the mailbox shutdown step) before the proxying happens in the test.

The test ensures that the proxy worker thread has been fully cancelled by registering a TLS key with a destructor called set_flag:

https://github.com/emscripten-core/emscripten/blob/106a27c1018cf3c6104541eb176b9d0182805841/test/pthread/test_pthread_proxying_canceled_work.c#L348

The proxy worker thread is passed the address of a canceled flag:

https://github.com/emscripten-core/emscripten/blob/106a27c1018cf3c6104541eb176b9d0182805841/test/pthread/test_pthread_proxying_canceled_work.c#L68-L69

Which it then associates with the TLS key before cancelling itself.

https://github.com/emscripten-core/emscripten/blob/106a27c1018cf3c6104541eb176b9d0182805841/test/pthread/test_pthread_proxying_canceled_work.c#L49-L51

set_flag then writes a 1 into the flag address on the next turn of the event loop, which should be after the thread has been cancelled and the mailbox closed:

https://github.com/emscripten-core/emscripten/blob/106a27c1018cf3c6104541eb176b9d0182805841/test/pthread/test_pthread_proxying_canceled_work.c#L36-L44

The test does not continue on to perform the proxying until this flag has been set:

https://github.com/emscripten-core/emscripten/blob/106a27c1018cf3c6104541eb176b9d0182805841/test/pthread/test_pthread_proxying_canceled_work.c#L71-L73

To investigate this further, I might punch through some abstractions to assert in the test that the proxy worker's thread mailbox is closed after that while (!canceled) loop. I would expect that assertion to fail precisely whenever this error you're seeing would occur. From there I would drill down and try to rule out the possibility that the flag is simply set before the cancelled thread closes its mailbox. If the bug is not there, then it must be in the mailbox shutdown mechanism itself.

tlively avatar Mar 14 '25 22:03 tlively