not ok - tbon.endpoint cannot be set
I've been seeing this failure regularly in CI, mainly in the inception builder for some reason:
2024-10-01T20:41:01.3261595Z flux-broker: setattr tbon.endpoint: File exists
2024-10-01T20:41:01.3262060Z
2024-10-01T20:41:01.3262573Z flux-start: 0: PMI_Abort(): fatal bootstrap error
2024-10-01T20:41:01.3263033Z
2024-10-01T20:41:01.3268384Z test_must_fail: died by non-SIGTERM signal: flux start -o,-Sbroker.rc1_path=,-Sbroker.rc3_path= -s2 -o,--setattr=tbon.endpoint=ipc:///tmp/customflux /bin/true
2024-10-01T20:41:01.3269615Z
2024-10-01T20:41:01.3270381Z [1m[31mnot ok 18 - tbon.endpoint cannot be set(B[m
2024-10-01T20:41:01.3271468Z [36mnot ok 18 - tbon.endpoint cannot be set(B[m
2024-10-01T20:41:01.3272240Z #
2024-10-01T20:41:01.3272847Z # test_must_fail_or_be_terminated flux start ${ARGS} -s2 \
2024-10-01T20:41:01.3273804Z # -o,--setattr=tbon.endpoint=ipc:///tmp/customflux /bin/true
2024-10-01T20:41:01.3275015Z #
2024-10-01T20:41:01.3275224Z
Well this test is racy in the sense that
- each broker sends the abort message then exits with a code of 1
flux-startreceives the abort message(s) and sends SIGKILL to both brokers
So flux-start should either exit with a code of 1 or 137, and both are allowed by test_must_fail_or_be_terminated.
Apparently it's not and we don't get to know what signal it was.
I will start a PR that fixes that shell function to show the signal number and see if I can repro in CI in my private fork.
This has be reproducing lately and we now get the extra information about what signal terminated the broker:
expecting success:
test_must_fail_or_be_terminated flux start ${ARGS} -s2 \
--setattr=tbon.endpoint=ipc:///tmp/customflux true
flux-broker: setattr tbon.endpoint: File exists
flux-start: 1 (pid 422592) exited with rc=1
flux-start: 1: PMI_Abort(): fatal bootstrap error
test_must_fail_or_be_terminated: died by signal 13: flux start -Sbroker.rc1_path= -Sbroker.rc3_path= -s2 --setattr=tbon.endpoint=ipc:///tmp/customflux true
not ok 18 - tbon.endpoint cannot be set
where signal 13 is SIGPIPE.
Perhaps we just change this test from test_must_fail_or_be_terminated to ! flux start ... with a note as to why.