detach_spawn failing flakily on travis
The detach_spawn test from #2600 and #2601 just failed once on travis:
https://travis-ci.org/DynamoRIO/dynamorio/jobs/300006348
281/281 Test #265: code_api|api.detach_spawn ........................................***Failed Required regular expression not found.Regex=[^\.*all done
$
] 90.50 sec
..........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................<ERROR: master_signal_handler with no siginfo (i#26?): tid=20526, sig=12>
......................................Timeout after 90 seconds
Xref #2690
I could reproduce this error occurring in about 1:1000 magnitude (or less):
283: Test command: /usr/local/google/home/hgreving/dynamorio/build/bin64/runstats "-s" "90" "-killpg" "-silent" "-env" "LD_LIBRARY_PATH" "/usr/local/google/home/hgreving/dynamorio/build/lib64/debug:/usr/local/google/home/hgreving/dynamorio/build/ext/lib64/debug:" "-env" "DYNAMORIO_OPTIONS" "-stderr_mask 0xC -dumpcore_mask 0 -code_api" "/usr/local/google/home/hgreving/dynamorio/build/suite/tests/bin/api.detach_spawn" 283: Test timeout computed to be: 1500 283: ......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................<Application /usr/local/google/home/hgreving/dynamorio/build/suite/tests/bin/api.detach_spawn (111972). Failed to take over all threads after multiple attempts> Test #283: code_api|api.detach_spawn ........***Failed Required regular expression not found.Regex=[^.*all done $ ] 0.18 sec
I can reproduce timeout in about 1:10000. Looks like hang on attach, similar to xref #2601 . The thread in the wait loop never leaves the wait loop because supposedly another thread (that has not exited and still exists in /proc/pid/task) has not received or processed the suspend signal.
Issue is that a thread in its late exit stage acquires the dynamo_vm_areas_lock when freeing its dstack (stack_free -> .. -> release_memory_and_update_areas). There is a small window in write_lock and write_unlock where writer tid is not written yet (or was already written) after/before the atomic lock operation. If an asynch signal arrives just there, the lock is in an undefined state, and a potential reader can't identify itself as a writer (writer tid). In this case, a take over thread is sending take over signals, and the signal handler ends up querying is_dynamo_address that tries to acquire the lock above for reading, and hangs.
The is_dynamo_address() check is solving #2688. Perhaps we can solve that a different way and eliminate its need for this lock. If #2688 is limited to a child thread, maybe we can rely on our setting a valid TLS with an invalid magic field to be inherited from the parent and we can check whether this thread in question has such a TLS, instead of checking its PC.
However -- then with this new attach request in this issue, we'd self-interp. So we need either the is_dynamo_address() or we need some new synch to block a new attach until all threads are completely exited.
Our observation is that in this test, a thread's exit had already started before detach started. During a following attach, the synchronization code A) either has a bug or B) more likely, the synchronization code intentionally and correctly ignores the exiting thread, but we are missing a barrier that waits for the exiting thread to finish before starting a (re-)takeover.
The synchronization/barrier at detach was inspected and it turns out that it is currently is implemented to (supposed to-) actually wait for exiting threads. Adding assertions of exiting_thread_count == 0 at the end of detach synchronization as well as adding printf/debugging showed that the problem is not the barrier. New lead: suspecting now that a thread that was signaled and taken over (test has lots of threads). The thread then happens to exit again, and it's thread record is gone, and in the next round of take over signals, the thread looks like a native thread, because it is still in the /proc list.
I'm confused: if the thread exited how is it still in /proc?
I'm confused: if the thread exited how is it still in /proc?
Hasn't exited, it's on its way to exit, I think it's a thread that's at the exit syscall in DR past "remove_thread".
Please clean up the code comments and name change:
# FIXME i#2694: failing sometimes on 32-bit Linux.
set(detach_spawn_name api.detach_spawn_FLAKY)
Also, it seems to still fail on CDash:
http://dynamorio.org/CDash/testDetails.php?test=629798&build=41238
Looks like the cdash failure is 32-bit linux.
Looks like the cdash failure is 32-bit linux.
So was the original Travis failure that was the reason for creating this issue: https://travis-ci.org/DynamoRIO/dynamorio/jobs/300006348
xref #3366
I have not been able to find out what exactly is going wrong - yet. But I believe it is 32-bit specific and will lower priority to go after it a little bit.
- SIGSEGV happens post-syscall on rax spill instruction mov %eax,%fs:0x0
- From adding nops post-syscall to do_syscall gencode, I know that the syscall was not a clone syscall. But I do not know what syscall it actually is. It may not matter anyway.
- By adding a nop, I know SIGSEGV does happen on rax spill instruction (looks like TLS invalid, 32-bit fs is SEG_TLS)
- By adding various printfs, I have never been able to get anything printed from the faulting thread before the SIGSEGV. This includes dymamo_start. This suggests that the thread was just cloned, although I don't understand why, if the syscall gencode is not clone, what else it would be. I would expect a printf to show up in dynamo_start if the clone had happened already for this thread.
- I can't really save rax pre-syscall in order to get the syscall number, because I can't rely on having TLS since this is the very nature of the bug.
Since 32-bit is using another TLS method than 64-bit, I am currently suspecting that it is 32-bit specific.
Investigated further:
- I am printing xax at the beginning of dispatch in
if (get_at_syscall(dcontext) && was_thread_create_syscall(dcontext))which should print the tid of a child when the parent is post-syscall. The tid of the thread that runs into SIGSEGV never shows up. - I am printing the tid in handle_system_call of dispatch. The tid of the SIGSEGV thread also never shows up.
- I am printing xax at os.c:post_system_call of SYS_clone. Again, the tid does not show up.
- I double-checked and verified that the XIP of the fault is the do_syscall gencode. Basically it is puzzling how the new thread gets there in the first place.
- Thinking about how the new thread may end up in the code cache, I've tried running with -disable_traces -nolink -no_link_ibl. I can still reproduce the bug with the switches.
Investigated further:
- When inspecting sc->xsp of the thread that runs into SIGSEGV, it looks like the value is always not at a map's bound. I am getting a delta of about 0x32C0 suggesting that there is something on the stack.
- Inspecting the stack with sc->xsp is inconclusive. Sometimes there are symbols on the stack suggesting the thread was in libpthread, which would make sense.
Questions still open to answer:
- The thread that runs into SIGSEGV seems to receive the signal the signal in the code cache as pointed out above. However, adding print_file(STDERR) at the beginning of dynamo_start and at new_thread_setup never prints the get_sys_thread_id() as seen in the signal handler, suggesting it was never taken over? How does it end up in the code cache then?
- Adding the same print at handle_system_call also does not print the same tid. How does the thread end up at do_syscall gencode?
- It is therefore completely unclear how the thread ends up in the code cache or whether the tid is actually even correct.
Investigated further:
- Moving faulting instruction pre-syscall has no effect (does not fault).
- Enabling the checks in get_thread_private_dcontext has no effect either (doesn't fire).
- As suggested by Derek, I added a raw clean call in front of syscall in gencode push/pop callee saved registers. I added print_file(STDERR, "was here\n") in that call. The printf does not show up.
Are we actually executing code pre-syscall or are we ending up post-syscall through a different path?
The vsyscall hook targets after_do_shared_syscall_addr and is executed for native threads too so it may be at the root of what is going on here.
Conclusion (no fix yet): a native thread that hadn't been taken over yet follows the vsyscall hook in the vdso page, and our post-syscall code assuming valid TLS is causing a crash. We can't handle threads that had never been taken over. Such native threads w/o TLS will follow the hook and will crash when spilling to TLS post-syscall before the jump to the linkstub. More synchronization or no-TLS handling is needed.
This issue also covers / will cover 32-bit flakiness of the new test api.detach_spawn_stress.
This test fails quite often on CDash, regularly failing quickly with the "Cannot correctly handle a received signal" fatal error, which occurs after the "master_signal_handler with no siginfo" message from the first comment in this issue.
Whoops sorry: the much more regular failing is from the "detach_spawn_stress" version that was very recently added. It makes sense that the "stress" version fails more often!
I thought the stress version was marked as _FLAKY? So Travis/Appveyor will ignore failures. CDash currently won't: I thought there was an issue on that but I can't find it. I'm not sure it would be easy b/c CDash is directly integrated into CTest with no indirection where we can insert scripting. We'd have to find/add some flaky support in CDash itself to not send emails or not turn red on flaky or to re-run.
Failed on master merge https://github.com/DynamoRIO/dynamorio/actions/runs/16601475818/job/46962050688 with /opt/actions-runner/_work/dynamorio/dynamorio/build/build_debug-internal-64/suite/tests/bin/api.detach_spawn (29069). Cannot correctly handle received signal 7 in thread 29635: default action in native thread.>
Failed again on master merge https://github.com/DynamoRIO/dynamorio/actions/runs/16669139799/attempts/1 with:
<Application /opt/actions-runner/_work/dynamorio/dynamorio/build/build_debug-internal-64/suite/tests/bin/api.detach_spawn (29390). Cannot correctly handle received signal 7 in thread 30809: default action in native thread.>