flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

not ok - tbon.endpoint cannot be set

Open grondo opened this issue 1 year ago • 1 comments

I've been seeing this failure regularly in CI, mainly in the inception builder for some reason:

2024-10-01T20:41:01.3261595Z flux-broker: setattr tbon.endpoint: File exists
2024-10-01T20:41:01.3262060Z 
2024-10-01T20:41:01.3262573Z flux-start: 0: PMI_Abort(): fatal bootstrap error
2024-10-01T20:41:01.3263033Z 
2024-10-01T20:41:01.3268384Z test_must_fail: died by non-SIGTERM signal: flux start -o,-Sbroker.rc1_path=,-Sbroker.rc3_path= -s2 -o,--setattr=tbon.endpoint=ipc:///tmp/customflux /bin/true
2024-10-01T20:41:01.3269615Z 
2024-10-01T20:41:01.3270381Z [1m[31mnot ok 18 - tbon.endpoint cannot be set(B[m
2024-10-01T20:41:01.3271468Z [36mnot ok 18 - tbon.endpoint cannot be set(B[m
2024-10-01T20:41:01.3272240Z #	
2024-10-01T20:41:01.3272847Z #		test_must_fail_or_be_terminated flux start ${ARGS} -s2 \
2024-10-01T20:41:01.3273804Z #			-o,--setattr=tbon.endpoint=ipc:///tmp/customflux /bin/true
2024-10-01T20:41:01.3275015Z #	
2024-10-01T20:41:01.3275224Z

grondo avatar Oct 01 '24 21:10 grondo

Well this test is racy in the sense that

  • each broker sends the abort message then exits with a code of 1
  • flux-start receives the abort message(s) and sends SIGKILL to both brokers

So flux-start should either exit with a code of 1 or 137, and both are allowed by test_must_fail_or_be_terminated.

Apparently it's not and we don't get to know what signal it was.

I will start a PR that fixes that shell function to show the signal number and see if I can repro in CI in my private fork.

garlick avatar Oct 01 '24 22:10 garlick

This has be reproducing lately and we now get the extra information about what signal terminated the broker:

expecting success: 
  	test_must_fail_or_be_terminated flux start ${ARGS} -s2 \
  		--setattr=tbon.endpoint=ipc:///tmp/customflux true
  
  flux-broker: setattr tbon.endpoint: File exists
  flux-start: 1 (pid 422592) exited with rc=1
  flux-start: 1: PMI_Abort(): fatal bootstrap error
  test_must_fail_or_be_terminated: died by signal 13: flux start -Sbroker.rc1_path= -Sbroker.rc3_path= -s2 --setattr=tbon.endpoint=ipc:///tmp/customflux true
  not ok 18 - tbon.endpoint cannot be set

where signal 13 is SIGPIPE.

Perhaps we just change this test from test_must_fail_or_be_terminated to ! flux start ... with a note as to why.

grondo avatar Jan 18 '25 23:01 grondo