flux-core
flux-core copied to clipboard
t4000-issues "free-range-test" failures in CI
This test has been failing sporadically in CI.
The test kills rank 3 of a size=4 broker with SIGKILL and expects the instance to continue to be able to run a 3 node job. This part seems to work, but the instance sometimes exits with wait status=9 (Killed) instead of the expected status=0.
free-range-test: Re-launching test script under flux-start
free-range-test: Starting a child instance with flat topology
free-range-test: Started job fuKmers
free-range-test: Current overlay status of fuKmers:
0 fv-az1196-271: full
├─ 1 fv-az1196-271: full
├─ 2 fv-az1196-271: full
└─ 3 fv-az1196-271: full
free-range-test: Launch a sleep job within fuKmers:
f2A3iXBV
JOBID USER ST NTASKS NNODES RUNTIME
. runner R 4 4 6.352s .
fuKmers runner R 4 4 3.782s └── flux
f2A3iXBV runner R 4 4 0.404s └── sleep
free-range-test: Killing rank 3 (pid 146203) and all children
free-range-test: Wait for exception event in fuKmers
flux-start: 3 (pid 146203) Killed
Sep 26 03:27:30.610493 UTC broker.err[0]: fv-az1196-271 (rank 3) failed
1727321250.599241 exception type="node-failure" severity=2 userid=1001 note="shell rank 3 (on fv-az1196-271): Killed"
free-range-test: But running a 3 node job in fuKmers still works:
Sep 26 03:27:30.713759 UTC broker.err[0]: dead to Flux: fv-az1196-271 (rank 3)
fv-az1196-271
fv-az1196-271
fv-az1196-271
free-range-test: Overlay status of fuKmers should show rank lost:
0 fv-az1196-271: degraded
├─ 1 fv-az1196-271: full
├─ 2 fv-az1196-271: full
└─ 3 fv-az1196-271: lost lost connection
free-range-test: Call flux shutdown on fuKmers
free-range-test: job fuKmers should exit cleanly (no hang) and a zero exit code:
1727321253.658158 finish status=9
free-range-test: dump output from job:
4.081s: job.exception type=node-failure severity=2 shell rank 3 (on fv-az1196-271): Killed
flux-job: job shell Killed
Sep 26 03:27:31.108645 UTC broker.err[0]: fv-az1196-271 (rank 3) failed
Sep 26 03:27:31.209570 UTC broker.err[0]: dead to Flux: fv-az1196-271 (rank 3)
Sep 26 03:27:33.725778 UTC broker.err[0]: rc2.0: /usr/src/t/issues/t4583-free-range-test.sh Exited (rc=137) 8.1s
flux-start: 0 (pid 146196) exited with rc=137
Sep 26 03:27:35.870980 UTC broker.err[0]: rc2.0: /usr/src/t/issues/t4583-free-range-test.sh Exited (rc=137) 12.0s
not ok 8 - t4583-free-range-test