flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

t4000-issues "free-range-test" failures in CI

Open grondo opened this issue 4 months ago • 0 comments

This test has been failing sporadically in CI.

The test kills rank 3 of a size=4 broker with SIGKILL and expects the instance to continue to be able to run a 3 node job. This part seems to work, but the instance sometimes exits with wait status=9 (Killed) instead of the expected status=0.

  free-range-test: Re-launching test script under flux-start
  free-range-test: Starting a child instance with flat topology
  free-range-test: Started job fuKmers
  free-range-test: Current overlay status of fuKmers:
  0 fv-az1196-271: full
  ├─ 1 fv-az1196-271: full
  ├─ 2 fv-az1196-271: full
  └─ 3 fv-az1196-271: full
  free-range-test: Launch a sleep job within fuKmers:
  f2A3iXBV
         JOBID USER     ST NTASKS NNODES  RUNTIME
             . runner    R      4      4   6.352s .
       fuKmers runner    R      4      4   3.782s └── flux
      f2A3iXBV runner    R      4      4   0.404s     └── sleep
  free-range-test: Killing rank 3 (pid 146203) and all children
  free-range-test: Wait for exception event in fuKmers
  flux-start: 3 (pid 146203) Killed
  Sep 26 03:27:30.610493 UTC broker.err[0]: fv-az1196-271 (rank 3) failed
  1727321250.599241 exception type="node-failure" severity=2 userid=1001 note="shell rank 3 (on fv-az1196-271): Killed"
  free-range-test: But running a 3 node job in fuKmers still works:
  Sep 26 03:27:30.713759 UTC broker.err[0]: dead to Flux: fv-az1196-271 (rank 3)
  fv-az1196-271
  fv-az1196-271
  fv-az1196-271
  free-range-test: Overlay status of fuKmers should show rank lost:
  0 fv-az1196-271: degraded
  ├─ 1 fv-az1196-271: full
  ├─ 2 fv-az1196-271: full
  └─ 3 fv-az1196-271: lost lost connection
  free-range-test: Call flux shutdown on fuKmers
  free-range-test: job fuKmers should exit cleanly (no hang) and a zero exit code:
  1727321253.658158 finish status=9
  free-range-test: dump output from job:
  
  4.081s: job.exception type=node-failure severity=2 shell rank 3 (on fv-az1196-271): Killed
  flux-job: job shell Killed
  Sep 26 03:27:31.108645 UTC broker.err[0]: fv-az1196-271 (rank 3) failed
  
  Sep 26 03:27:31.209570 UTC broker.err[0]: dead to Flux: fv-az1196-271 (rank 3)
  
  Sep 26 03:27:33.725778 UTC broker.err[0]: rc2.0: /usr/src/t/issues/t4583-free-range-test.sh Exited (rc=137) 8.1s
  flux-start: 0 (pid 146196) exited with rc=137
  Sep 26 03:27:35.870980 UTC broker.err[0]: rc2.0: /usr/src/t/issues/t4583-free-range-test.sh Exited (rc=137) 12.0s
  not ok 8 - t4583-free-range-test

grondo avatar Sep 26 '24 13:09 grondo