sel4test
sel4test copied to clipboard
test MULTICORE_0004 sometimes hangs
The test is already marked as flaky
. If at all possible we should make this non-flaky.
There is also a comment inside the test that says it can never properly fail and would crash if not succesful. The hang behaviour we're seeing might be such a crash, and the issue might therefore be real (or just a race in the test).
I've seen this mostly come up on sabre
, but not sure it was only there. (Edit: some specific configs are SABRE_verification_SMP_clang_32
*1 in 15 runs, IMX8MM_EVK_release_SMP_clang_64
* 2 in the same 15 runs)
Edit: also frequently on zynqm
and zynqmp106
for HYP
+ clang
+ SMP
.
This test is trying to check that the fifo kernel lock works. If the test is timing out, then that implies that the main thread isn't able to make progress, indicating that the fifo lock isn't working.
Not sure if this will help or hinder future investigation, but IMX8MQ_EVK_release_SMP_hyp_clang_64
is reliably hanging on MULTICORE_0004. Most other boards show this only very sporadically. Might be a different reason here (works with gcc, for instance, or in debug mode), but I thought it worth pointing out.
I'll be disabling this test for clang on this board, because enabling it was only an experiment (the hyp+smp combination on that board used to be not enabled/supported before we tried it).
In https://github.com/seL4/seL4/pull/1109#issuecomment-1749063200 a new failure mode for MULTICORE0004 popped up, triggering an assertion instead:
<testcase classname="sel4test" name="MULTICORE0004">
Running test MULTICORE0004 (Test core stalling is behaving properly (flaky))
seL4 failed assertion 'isRunnable(NODE_STATE(ksCurThread))' at /github/workspace/kernel/include/kernel/thread.h:291 in function checkBudgetRestart
halting...
Kernel entry via Syscall, number: 11, Yield
ODROID_C4_debug_SMP_MCS_clang_64 FAILED
Unclear to me if this is related to the hanging issue or not. In any case it might be useful for debugging to know that this assertion can be triggered in the test.
Edit: this is rare, but there was one more occurrence for https://github.com/seL4/seL4/commit/3c2c5ba18a87371bb289d1cc0ff561ae61d008db in this run for TQMA8XQP1GB_debug_SMP_MCS_clang_64.
I'm not sure if it's a new failure mode, the others happened for release builds.
In the test the tasks on other cores are calling seL4_Yield()
in a loop, while the main thread suspends them and then cleans them up. Because the kernel entry is via a yield call, the assert must be triggered by the MCS_DO_IF_BUDGET()
at the start of handleSyscall()
. This assert seems to proof that remoteTCBStall()
is broken.
Entirely possibly that it is the same problem (that would be great), I just hadn't seen this failure mode yet.
Adding as an observation here that I have seen this most frequently in the combination SMP+MCS+clang in the last few months.
Ok, so some more bizarre observations.
- The seL4 commit c28e52a1ec made the test MULTICORE0004 hang almost reliably on the
tqma
board in the configSMP_MCS_release_clang
(and only that config, all other combinations worked fine). - The CI test is built with the setting
BAMBOO
on, which setsCONFIG_PRINT_XML
, which prints XML output and buffers that output. The exact same build withBAMBOO
off works fine, I could not get the test to hang. WithBAMBOO
on, it hangs (almost?) every time. - The unrelated commit 2648df4 fixes it. I can't get the test to hang in any config any more (on 5 runs -- the other case above never needed more than 2).
Both commits are entirely unrelated, 2648df4 is not even in any code that is called, and it is an equivalence transformation in the contexts it can be called. The BAMBOO
setting doesn't have any influence on the kernel at all (it is not visible), only on user-space sel4test
, and this particular test does not print anything. The only printing calls are before and after.
All that you say seems to confirm what I suspect: The locking code is broken (sometimes miscompiled?), and whether that causes problems or not is timing dependent, so any small change can either trigger it or not. That also makes it very hard to figure out whether any change actually fixes the problem or not.
It would be good to add a time check to see whether MULTICORE_0004
takes too long to complete or not. If it doesn't finish quickly, it's also a sign that the locking is broken. Perhaps this would give us a reliable way to detect whether there is a problem, instead of the current rarer case of total hang.
Because the locking code is inlined it is very hard to review the generated assembly: We don't know which path is buggy for sure. The one I checked seemed fine, as far as I can tell.
I want to dust off https://github.com/seL4/seL4/pull/871, make it out-of-line, convert it to stdatomic instead of compiler intrinsics and see if the bug still happens. It would be good to check the locking algorithm with cppmem too. I suspect rewriting to stdatomic will make the problem go away.
Not sure if it's the same issue (might be relevant), but there was similar behaviour that we had to do this. The code has changed since then
Not sure if it's the same issue (might be relevant), but there was similar behaviour that we had to do this. The code has changed since then
That commit seems to paper over a bug where __atomic_exchange_n
can take an indeterminate of time, so to me sounds like a work-around for a broken locking implementation instead of a proper fix. I fixed that with this commit.