sel4test test MULTICORE_0004 sometimes hangs

The test is already marked as flaky. If at all possible we should make this non-flaky.

There is also a comment inside the test that says it can never properly fail and would crash if not succesful. The hang behaviour we're seeing might be such a crash, and the issue might therefore be real (or just a race in the test).

I've seen this mostly come up on sabre, but not sure it was only there. (Edit: some specific configs are SABRE_verification_SMP_clang_32 *1 in 15 runs, IMX8MM_EVK_release_SMP_clang_64 * 2 in the same 15 runs)

Edit: also frequently on zynqm and zynqmp106 for HYP + clang + SMP.

Aug 13 '21 01:08 lsf37

This test is trying to check that the fifo kernel lock works. If the test is timing out, then that implies that the main thread isn't able to make progress, indicating that the fifo lock isn't working.

Aug 13 '21 05:08 kent-mcleod

Not sure if this will help or hinder future investigation, but IMX8MQ_EVK_release_SMP_hyp_clang_64 is reliably hanging on MULTICORE_0004. Most other boards show this only very sporadically. Might be a different reason here (works with gcc, for instance, or in debug mode), but I thought it worth pointing out.

I'll be disabling this test for clang on this board, because enabling it was only an experiment (the hyp+smp combination on that board used to be not enabled/supported before we tried it).

Jun 01 '23 07:06 lsf37

In https://github.com/seL4/seL4/pull/1109#issuecomment-1749063200 a new failure mode for MULTICORE0004 popped up, triggering an assertion instead:

<testcase classname="sel4test" name="MULTICORE0004">
  Running test MULTICORE0004 (Test core stalling is behaving properly (flaky))
  seL4 failed assertion 'isRunnable(NODE_STATE(ksCurThread))' at /github/workspace/kernel/include/kernel/thread.h:291 in function checkBudgetRestart
  halting...
  Kernel entry via Syscall, number: 11, Yield

ODROID_C4_debug_SMP_MCS_clang_64 FAILED

Unclear to me if this is related to the hanging issue or not. In any case it might be useful for debugging to know that this assertion can be triggered in the test.

Edit: this is rare, but there was one more occurrence for https://github.com/seL4/seL4/commit/3c2c5ba18a87371bb289d1cc0ff561ae61d008db in this run for TQMA8XQP1GB_debug_SMP_MCS_clang_64.

Oct 05 '23 21:10 lsf37

I'm not sure if it's a new failure mode, the others happened for release builds.

In the test the tasks on other cores are calling seL4_Yield() in a loop, while the main thread suspends them and then cleans them up. Because the kernel entry is via a yield call, the assert must be triggered by the MCS_DO_IF_BUDGET() at the start of handleSyscall(). This assert seems to proof that remoteTCBStall() is broken.

Oct 06 '23 09:10 Indanz

Entirely possibly that it is the same problem (that would be great), I just hadn't seen this failure mode yet.

Oct 08 '23 22:10 lsf37

Adding as an observation here that I have seen this most frequently in the combination SMP+MCS+clang in the last few months.

Jun 18 '24 03:06 lsf37

Ok, so some more bizarre observations.

The seL4 commit c28e52a1ec made the test MULTICORE0004 hang almost reliably on the tqma board in the config SMP_MCS_release_clang (and only that config, all other combinations worked fine).
The CI test is built with the setting BAMBOO on, which sets CONFIG_PRINT_XML, which prints XML output and buffers that output. The exact same build with BAMBOO off works fine, I could not get the test to hang. With BAMBOO on, it hangs (almost?) every time.
The unrelated commit 2648df4 fixes it. I can't get the test to hang in any config any more (on 5 runs -- the other case above never needed more than 2).

Both commits are entirely unrelated, 2648df4 is not even in any code that is called, and it is an equivalence transformation in the contexts it can be called. The BAMBOO setting doesn't have any influence on the kernel at all (it is not visible), only on user-space sel4test, and this particular test does not print anything. The only printing calls are before and after.

Jun 21 '24 10:06 lsf37

All that you say seems to confirm what I suspect: The locking code is broken (sometimes miscompiled?), and whether that causes problems or not is timing dependent, so any small change can either trigger it or not. That also makes it very hard to figure out whether any change actually fixes the problem or not.

It would be good to add a time check to see whether MULTICORE_0004 takes too long to complete or not. If it doesn't finish quickly, it's also a sign that the locking is broken. Perhaps this would give us a reliable way to detect whether there is a problem, instead of the current rarer case of total hang.

Because the locking code is inlined it is very hard to review the generated assembly: We don't know which path is buggy for sure. The one I checked seemed fine, as far as I can tell.

I want to dust off https://github.com/seL4/seL4/pull/871, make it out-of-line, convert it to stdatomic instead of compiler intrinsics and see if the bug still happens. It would be good to check the locking algorithm with cppmem too. I suspect rewriting to stdatomic will make the problem go away.

Jun 21 '24 11:06 Indanz

Not sure if it's the same issue (might be relevant), but there was similar behaviour that we had to do this. The code has changed since then

Jul 24 '24 08:07 heshamelmatary

Not sure if it's the same issue (might be relevant), but there was similar behaviour that we had to do this. The code has changed since then

That commit seems to paper over a bug where __atomic_exchange_n can take an indeterminate of time, so to me sounds like a work-around for a broken locking implementation instead of a proper fix. I fixed that with this commit.

Aug 05 '24 11:08 Indanz

sel4test sel4test copied to clipboard

test MULTICORE_0004 sometimes hangs

sel4test
sel4test copied to clipboard