Intermittent Hangs in CI
After some kind of runtime environment update to the github actions CI runners (as of Wednesday 9/25) we're now seeing very occasional hangs during the CI tests. This affects all Linux build configurations and happens either during the basics/arbitrary_blocking_operation test or the features/qlfqueue test.
This will probably be a pain to debug so I don't think it's feasible to get it fixed before the upcoming release, however I have confirmed that these new failures are not caused by the new code in the release. It's an existing issue that only surfaces in the newer build environment.
This seems to have gone away on its own. My best bet is that #297 fixed it, but I'm still not entirely sure.
Just saw this again the day after closing it, but it's not showing up today. Perhaps the timeouts just need to be increased. Running on a busier CI instance could probably cause this. Leaving it closed for now, but I'm still keeping an eye out for it.
Alright, we're still seeing these even after increasing the timeout thresholds. See https://github.com/sandialabs/qthreads/actions/runs/11448461793/job/31852093932?pr=305 for an example.
I wonder if this is related to https://github.com/sandialabs/qthreads/issues/267#issuecomment-2423163776 as well. Plausible, but currently unclear.
Figured out how to dig a backtrace out of this! Thanks to https://stackoverflow.com/a/8657833.
The winning incantation ends up being:
for ((i=0; i<2000; i++)); do QT_NUM_SHEPHERDS=2 QT_NUM_WORKERS_PER_SHEPHERD=1 gdb -ex='set confirm on' -ex='set disable-randomization off' -ex=r -ex=quit --args ./arbitrary_blocking_operation && echo "$i"; done
From there you wait for one of the runs to hang, do a keyboard interrupt, decline to exit gdb, then get the backtraces on all the threads from the interrupt.
It still has to go through many runs to hang, and then it's a pain to interrupt the loop after successfully getting a backtrace, but at least this works!
Here's a cleaned up example result from arbitrary_blocking_operation in the sherwood/binders config:
Thread 2
#0 sched_yield () from libc.so
#1 qt_scheduler_get_thread () from libqthread.so
#2 qthread_master () from libqthread.so
#3 start_thread () from libpthread.so
#4 clone () from libc.so
Thread 1:
#0 sched_yield () from libc.so
#1 qt_scheduler_get_thread () from libqthread.so
#2 qthread_master () from libqthread.so
#3 ?? ()
Another example from arbitrary_blocking_operation in the sherwood/binders config (with more details and line numbers this time):
Thread 2:
#0 sched_yield () from libc.so.6
#1 in qthread_steal at qthreads/src/threadqueues/sherwood_threadqueues.c:900
#2 qt_scheduler_get_thread (q=q@entry=, qc=qc@entry=0x0, active=1 '\001') at qthreads/src/threadqueues/sherwood_threadqueues.c:672
#3 in qthread_master at qthreads/src/qthread.c:363
#4 in start_thread () from libpthread.so.0
#5 in clone () from libc.so.6
Thread 1:
#0 in sched_yield () from libc.so.6
#1 in qthread_steal at qthreads/src/threadqueues/sherwood_threadqueues.c:900
#2 qt_scheduler_get_thread (q=q@entry=, qc=qc@entry=0x0, active=1 '\001') at qthreads/src/threadqueues/sherwood_threadqueues.c:672
#3 qthread_master at qthreads/src/qthread.c:363
#4 0x0000000000000000 in ?? ()
Just saw a similar hang in the qthread_readstate test. Specifically with the nvc/sherwood/hwloc config. It's very likely related. I'll see if I can reproduce that locally.
Managed to reproduce this with work stealing disabled (though it's much more rare in that case), so the bug is very likely in the sherwood implementation of qt_scheduler_get_thread. Both threads currently get stuck spinning at https://github.com/sandialabs/qthreads/blob/4396ce86b0128d13584fe992cc234b934a6cbdc6/src/threadqueues/sherwood_threadqueues.c#L720.
For future reference, here's the version that breaks the loop more easily
for ((i=0; i<20000; i++)); do QT_NUM_SHEPHERDS=2 QT_NUM_WORKERS_PER_SHEPHERD=1 gdb -return-child-result -ex='set confirm on' -ex='set disable-randomization off' -ex=r -ex=quit --args ./build/test/basics/arbitrary_blocking_operation && echo "$i" || break; done || exit 1
Tentatively closing this as fixed. #338 seems to have done the trick, at least for the arbitrary_blocking_operation test. I was never able to reproduce any of the other seemingly related failures.
In theory this can still happen any time there's a severe enough performance distortion in the system we're running on, but that patch takes that from "rare" to "this should basically never happen" zone.
The way we handle work items and system threads in the io subsystem seems a bit odd to me, but I don't have any clear idea of a better way to handle it yet. If I manage to cobble together a plan for how to make that code more robust going forward I'll just make a separate issue for it.
Looks like the qthread_readstate failure is still there. It just showed up in one of the arm/gcc/distrib/no-topology builds. My guess is that it's not actually related to arbitrary_blocking_operation failure so I'll create a separate issue.
Just saw this somehow show up again in https://app.circleci.com/pipelines/github/insertinterestingnamehere/qthreads/759/workflows/83b3eed1-f4bf-46e8-86e3-dab1b892cd65/jobs/58881. Seems like an anomaly, but the io subsystem still should be more robust to that. For now just documenting this until I get an idea of how to approach it.