WIP: relieve scheduler of the need to report resource status
As discussed in #5776, this adds
- resource tracking in the job manager and a new RPC to request the current set of allocated resources
- expansion of the payload of
resource.statusto include the keys found insched.resource-status - update users (top, python) to call
resource.statusinstead ofsched.resource-status
Marking as a WIP pending
- [x] see what the throughput impact is
- [x] ~decide what to do with the
alloc-checkplugin since this effectively does the same thing~ Edit: second iteration of the design no longer does this - [ ] decide if it was a good idea to combine the two RPCs or whether we should provide two, or perhaps take a step back and design a new one.
- [ ] should we leave the scheduler RPC handlers in place and provide some way to use them in test? Edit: keep for back compat, but add tests so the code is not left uncovered
- [ ] update flux-resource(1) which mentions "scheduler view of resources" a lot
- [ ]
flux resource statuscurrently calls theresource.statusRPCs twice
Hmm, getting an ASAN failure here:
2024-03-15T15:21:32.2114399Z Making check in common
2024-03-15T15:21:32.2245352Z AddressSanitizer:DEADLYSIGNAL
2024-03-15T15:21:32.2246109Z =================================================================
2024-03-15T15:21:32.2249768Z [1m[31m==37186==ERROR: AddressSanitizer: SEGV on unknown address 0x63abbfc31750 (pc 0x7ab1a6bb7cb7 bp 0x000000000000 sp 0x7ffe4ec09ea0 T0)
2024-03-15T15:21:32.2251177Z [1m[0m==37186==The signal is caused by a READ memory access.
2024-03-15T15:21:32.2254299Z AddressSanitizer:DEADLYSIGNAL
2024-03-15T15:21:32.2254926Z AddressSanitizer: nested bug in the same thread, aborting.
2024-03-15T15:21:32.2258036Z make[1]: *** [Makefile:1027: check-recursive] Error 1
2024-03-15T15:21:32.2308431Z make: *** [Makefile:514: check-recursive] Error 1
and failing sched tests
2024-03-15T15:30:30.0668719Z 18/92 Test #20: t1012-find-status.t ...................***Failed 5.82 sec
2024-03-15T15:30:30.0744267Z expecting success:
2024-03-15T15:30:30.0744721Z flux resource list > resource.list.out &&
2024-03-15T15:30:30.0746652Z validate_list_row resource.list.out 2 0 0 0 &&
2024-03-15T15:30:30.0747487Z validate_list_row resource.list.out 3 4 176 16 &&
2024-03-15T15:30:30.0748166Z validate_list_row resource.list.out 4 1 44 4
2024-03-15T15:30:30.0748609Z
2024-03-15T15:30:30.0749421Z not ok 14 - find/status: flux resource list works
2024-03-15T15:30:30.0785797Z 27/92 Test #29: t1021-qmanager-nodex.t ................***Failed 6.44 sec
2024-03-15T15:30:30.0801907Z expecting success:
2024-03-15T15:30:30.0802300Z cat >status.expected1 <<-EOF &&
2024-03-15T15:30:30.0802597Z 2 88 8
2024-03-15T15:30:30.0802784Z 2 88 8
2024-03-15T15:30:30.0803047Z EOF
2024-03-15T15:30:30.0803284Z flux resource list > resources.out1 &&
2024-03-15T15:30:30.0803756Z cat resources.out1 | grep -E "(free|alloc)" \
2024-03-15T15:30:30.0804304Z | awk "{ print \$2,\$3,\$4 }" > status.out1 &&
2024-03-15T15:30:30.0804878Z test_cmp status.expected1 status.out1
2024-03-15T15:30:30.0805121Z
2024-03-15T15:30:30.0805347Z --- status.expected1 2024-03-15 15:29:16.098812556 +0000
2024-03-15T15:30:30.0805809Z +++ status.out1 2024-03-15 15:29:16.286811847 +0000
2024-03-15T15:30:30.0806168Z @@ -1,2 +1,2 @@
2024-03-15T15:30:30.0806398Z -2 88 8
2024-03-15T15:30:30.0806601Z -2 88 8
2024-03-15T15:30:30.0806781Z +4 88 16
2024-03-15T15:30:30.0806973Z +2 88 0
2024-03-15T15:30:30.0807303Z not ok 6 - qmanager-nodex: free/alloc node count (hinodex)
2024-03-15T15:30:30.0832122Z expecting success:
2024-03-15T15:30:30.0832615Z cat >status.expected2 <<-EOF &&
2024-03-15T15:30:30.0833130Z 0 0 0
2024-03-15T15:30:30.0833446Z 4 176 16
2024-03-15T15:30:30.0833770Z EOF
2024-03-15T15:30:30.0834161Z flux resource list > resources.out2 &&
2024-03-15T15:30:30.0834930Z cat resources.out2 | grep -E "(free|alloc)" \
2024-03-15T15:30:30.0835613Z | awk "{ print \$2,\$3,\$4 }" > status.out2 &&
2024-03-15T15:30:30.0836241Z test_cmp status.expected2 status.out2
2024-03-15T15:30:30.0836615Z
2024-03-15T15:30:30.0836947Z --- status.expected2 2024-03-15 15:29:16.670810399 +0000
2024-03-15T15:30:30.0837648Z +++ status.out2 2024-03-15 15:29:16.830809797 +0000
2024-03-15T15:30:30.0838235Z @@ -1,2 +1,2 @@
2024-03-15T15:30:30.0838607Z -0 0 0
2024-03-15T15:30:30.0838961Z -4 176 16
2024-03-15T15:30:30.0839303Z +4 0 16
2024-03-15T15:30:30.0839643Z +4 176 0
2024-03-15T15:30:30.0840206Z not ok 9 - qmanager-nodex: free/alloc node count 2 (hinodex)
2024-03-15T15:30:30.0865149Z expecting success:
2024-03-15T15:30:30.0865545Z cat >status.expected3 <<-EOF &&
2024-03-15T15:30:30.0865849Z 2 88 8
2024-03-15T15:30:30.0866036Z 2 88 8
2024-03-15T15:30:30.0866301Z EOF
2024-03-15T15:30:30.0866548Z flux resource list > resources.out3 &&
2024-03-15T15:30:30.0867207Z cat resources.out3 | grep -E "(free|alloc)" \
2024-03-15T15:30:30.0867639Z | awk "{ print \$2,\$3,\$4 }" > status.out3 &&
2024-03-15T15:30:30.0868319Z test_cmp status.expected3 status.out3
2024-03-15T15:30:30.0868701Z
2024-03-15T15:30:30.0869053Z --- status.expected3 2024-03-15 15:29:17.794806161 +0000
2024-03-15T15:30:30.0869520Z +++ status.out3 2024-03-15 15:29:17.894805784 +0000
2024-03-15T15:30:30.0869875Z @@ -1,2 +1,2 @@
2024-03-15T15:30:30.0870177Z -2 88 8
2024-03-15T15:30:30.0870528Z -2 88 8
2024-03-15T15:30:30.0870721Z +4 88 16
2024-03-15T15:30:30.0870906Z +2 88 0
2024-03-15T15:30:30.0871255Z not ok 16 - qmanager-nodex: free/alloc node count (lonodex)
2024-03-15T15:30:30.0880480Z expecting success:
2024-03-15T15:30:30.0880794Z cat >status.expected4 <<-EOF &&
2024-03-15T15:30:30.0881078Z 0 0 0
2024-03-15T15:30:30.0881271Z 4 176 16
2024-03-15T15:30:30.0881467Z EOF
2024-03-15T15:30:30.0881692Z flux resource list > resources.out4 &&
2024-03-15T15:30:30.0882111Z cat resources.out4 | grep -E "(free|alloc)" \
2024-03-15T15:30:30.0882511Z | awk "{ print \$2,\$3,\$4 }" > status.out4 &&
2024-03-15T15:30:30.0882887Z test_cmp status.expected4 status.out4
2024-03-15T15:30:30.0883127Z
2024-03-15T15:30:30.0883381Z --- status.expected4 2024-03-15 15:29:18.218804562 +0000
2024-03-15T15:30:30.0883894Z +++ status.out4 2024-03-15 15:29:18.442803717 +0000
2024-03-15T15:30:30.0884254Z @@ -1,2 +1,2 @@
2024-03-15T15:30:30.0884514Z -0 0 0
2024-03-15T15:30:30.0884909Z -4 176 16
2024-03-15T15:30:30.0885248Z +4 0 16
2024-03-15T15:30:30.0885587Z +4 176 0
2024-03-15T15:30:30.0886227Z not ok 19 - qmanager-nodex: free/alloc node count 2 (lonodex)
Another thing to consider here is backwards compatibility (perhaps a good reason to keep the sched.resource-status RPC for a bit). E.g. if a newer instance of Flux is running under an older version, tools like flux jobs, flux top, and flux pstree would get ENOSYS when attempting to get the subinstance resource information.
I'm looking into the ASan errors, which are reproducible but only in the CI environment. In this case, libasan is getting a SEGV while running make. Here's more info:
runner@fv-az523-355:/usr/src/src$ ASAN_OPTIONS=$ASAN_OPTIONS,verbosity=1 LD_PRELOAD=/usr/lib64/libasan.so.6 make
==134==AddressSanitizer: failed to intercept '__isoc99_printf'
'==134==AddressSanitizer: failed to intercept '__isoc99_sprintf'
'==134==AddressSanitizer: failed to intercept '__isoc99_snprintf'
'==134==AddressSanitizer: failed to intercept '__isoc99_fprintf'
'==134==AddressSanitizer: failed to intercept '__isoc99_vprintf'
'==134==AddressSanitizer: failed to intercept '__isoc99_vsprintf'
'==134==AddressSanitizer: failed to intercept '__isoc99_vsnprintf'
'==134==AddressSanitizer: failed to intercept '__isoc99_vfprintf'
'==134==AddressSanitizer: failed to intercept 'xdr_quad_t'
'==134==AddressSanitizer: failed to intercept 'xdr_u_quad_t'
'==134==AddressSanitizer: failed to intercept 'xdr_destroy'
'==134==AddressSanitizer: libc interceptors initialized
|| `[0x10007fff8000, 0x7fffffffffff]` || HighMem ||
|| `[0x02008fff7000, 0x10007fff7fff]` || HighShadow ||
|| `[0x00008fff7000, 0x02008fff6fff]` || ShadowGap ||
|| `[0x00007fff8000, 0x00008fff6fff]` || LowShadow ||
|| `[0x000000000000, 0x00007fff7fff]` || LowMem ||
MemToShadow(shadow): 0x00008fff7000 0x000091ff6dff 0x004091ff6e00 0x02008fff6fff
redzone=16
max_redzone=2048
quarantine_size_mb=256M
thread_local_quarantine_size_kb=1024K
malloc_context_size=30
SHADOW_SCALE: 3
SHADOW_GRANULARITY: 8
SHADOW_OFFSET: 0x7fff8000
==134==Installed the sigaction for signal 11
==134==Installed the sigaction for signal 7
==134==Installed the sigaction for signal 8
==134==Deactivating ASan
AddressSanitizer:DEADLYSIGNAL
=================================================================
==134==ERROR: AddressSanitizer: SEGV on unknown address 0x60cf8ad2c9f8 (pc 0x71fac17b5cb7 bp 0x000000000000 sp 0x7ffd602751c0 T0)
==134==The signal is caused by a READ memory access.
AddressSanitizer:DEADLYSIGNAL
AddressSanitizer: nested bug in the same thread, aborting.
I have no idea why this just started happening, nor what to try next. Maybe I will try moving the asan to Fedora 36 and see if that addresses this issue.
Looking at the first sched test failure, the test fakes resources consisting of 4 nodes, each with 44 cores and 4 gpus. In the first failure, it allocates two nodes exclusively and expects flux resource list to show
- free: 2 nodes, 88 cores, 8 gpus
- alloc: 2 nodes, 88 cores, 8 gpus
But what it actually shows is
- free: 4 nodes, 88 cores, 16 gpus
- alloc: 2 nodes, 88 cores, 0 gpus
I think what may be going on is that librlist doesn't handle gpus for the things I'm asking it to do.
For completeness, the new objects returned are:
all
{
"version": 1,
"execution": {
"R_lite": [
{
"rank": "0-3",
"children": {
"core": "0-43",
"gpu": "0-3"
}
}
],
"starttime": 0,
"expiration": 0,
"nodelist": [
"sierra[3682,3179,3683,3178]"
]
}
}
alloc
{
"version": 1,
"execution": {
"R_lite": [
{
"rank": "0,2",
"children": {
"core": "0-43"
}
}
],
"starttime": 0,
"expiration": 0,
"nodelist": [
"sierra[3682-3683]"
]
}
}
Uhoh, that sounds like an unfortunate bug :-(
FWIW I was able to fix the sched test failures by returning the allocated resource set from the job-manager directly in resource.status instead of marking those resources allocated the full rlist and then extracting them again with rlist_copy_allocated().
Some other tests are failing now in flux-core that weren't before so I need to sort that out then post an update here.
I wonder how much work it would be to properly support gpus in librlist? Perhaps that would then allow sched-simple to allocate them (currently it cannot).
My other worry here is that librlist starts to get pretty slow when there are a lot of resources. I worry this will have a significant impact on throughput (I know you were already planning on testing that, but just a heads up). For example, in the testing in flux-framework/flux-sched#1137, we saw sched.resource-status for sched-simple taking >2s for 16K nodes.
Ah: the allocated set from job-manager.resource-status does not include properties so tests involving queues are failing in t2350-resource-list.t and t2801-top-cmd.t. I'm not sure if this is because sched-simple isn't returning them in the alloc response, or they are disappearing when passing through rlist_append(). But anyway before they were included because the allocated set was being carved out of the original R with rlist_copy_allocated().
Ah, probably dropped in rlist_append and that seems like a bug :-(
I wonder how much work it would be to properly support gpus in librlist? Perhaps that would then allow sched-simple to allocate > them (currently it cannot).
Yeah that would allow the original technique I was using to work. Supporting gpus in sched-simple seems like a pretty big bonus as well!
My other worry here is that librlist starts to get pretty slow when there are a lot of resources.
I was hoping that the rlist_append() / rlist_diff() functions I am using in job-manager would not be too bad, but after I get this working, I'll probe that with a large resource set.
Hmm, actually rlist_append() seems to pass properties through OK, but it doesn't look like sched-simple is returning queue properties in allocated Rs. Is it possible that was by design?
but it doesn't look like sched-simple is returning queue properties in allocated Rs. Is it possible that was by design?
I'm not sure, but I did think properties were passed through an alloc, e.g. on fluke the queue properties are available in subinstances (whether this is right or not is questionable)
grondo@fluke108:~$ flux alloc -N1 flux resource list
STATE PROPERTIES NNODES NCORES NGPUS NODELIST
free batch 1 4 0 fluke95
allocated 0 0 0
down 0 0 0
[detached: session exiting]
Hmm, maybe something was wrong with my test then. Thanks for that!
Edit: oh, that would be fluxion though. Maybe sched-simple behaves differently.
I added a workaround for now, in which properties are re-added to the allocated set by the resource module if missing.
FWIW throughput.py is stable at about 21 jobs/sec with 16K nodes/32 cores each, on master and on this branch with sched-simple. :shrug:
Oof. I guess sched-simple is so slow at that scale it doesn't matter. I'd like to see results for something smaller as well just in case though.
Here are some more numbers. It does look like there is a negative effect.
throughput.py -n 1000 (each run in a new instance)
Node: 32 cores / 4 gpus
fluxion simple fluxion simple
master master #5776 #5776
nnodes thruput thruput thruput thruput
64 595.4 497.0 551.8 505.9
625.9 523.2 570.5 524.1
600.4 508.8 527.7 497.4
128 441.4 522.4 429.6 497.6
468.7 495.4 440.9 485.9
484.6 505.2 426.3 475.3
256 251.7 501.7 242.0 469.6
254.7 499.7 238.6 481.4
249.9 491.5 233.3 454.2
512 142.5 463.2 138.4 456.5
143.6 482.7 138.2 456.8
146.2 465.7 137.3 460.5
1024 69.0 443.1 67.8 431.7
68.6 472.7 68.6 428.7
68.6 466.5 68.4 449.3
2048 33.1 434.5 32.6 428.5
32.8 458.5 32.8 428.3
32.8 447.5 33.5 435.2
4096 16.9 301.5 16.5 285.4
16.9 293.1 16.5 275.2
16.8 288.0 16.6 283.2
8192 8.3 104.1 8.2 103.3
8.3 108.2 8.4 97.7
8.3 108.0 8.3 103.3
16384 4.1 24.2 4.0 24.0
4.1 23.9 4.0 23.5
4.0 24.3 4.1 24.0
Hmm, maybe instead of keeping a "running R" in the job manager, it would be better (less impact on throughput) to simply gather the R's upon request and either combine them in the job manager or even ship them to the resource module and do it there.
The impact is not as bad as I'd feared! The rlist implementation is kind of dumb and not optimized at all, so we could look into improving it. This would help a bit with #5819 too.
However, since each time a user runs flux resource it will interrupt the job manager, moving the resource-list service to the resource or another module like we did for the job-list service is not a bad idea IMO.
I'll also note that the slow sched.resource-status RPC is an immediate concern on existing clusters, so a solution that works within the next couple releases would be better than spending more time implementing the "perfect" solution (However, it would probably be bad to break compatibility multiple times)
Sorry, all that may not have been too helpful. Just throwing some thoughts out there.
In case it wasn't clear, this PR already moves the contact point for the tools to the resource module, but the resource module doesn't do a whole lot except prepare the response. The RPC in the job manager that returns the allocated set is used exclusively by the resource module. So my comment above was about shifting some of the work from the job manager to the resource module to reduce d the penalty on every allocation.
So my comment above was about shifting some of the work from the job manager to the resource module to reduce d the penalty on every allocation.
That makes sense, sorry I misunderstood. So on receipt of a resource.status request, the resource module then requests the allocated set from the job manager, but in the current implementation the job-manager has already prepared the allocated resource set, so all it has to do is return it? This is a pretty clever implementation IMO.
Shifting some of the work to the resource module seems like a good idea to me. Your comment about just shipping to the resource module makes a lot of sense now that I better understand.
Just pushed the change discussed above. I started testing throughput and then realized this doesn't touch the critical path at all so there is little point.
The job manager RPC now just walks all active jobs, creating a union of all allocated Rs, which it returns. It's kind of dumb and simple, but if it turns out to be slow in some scenario (for example if there are a huge number of active jobs), there are obvious things to do to improve responsiveness such as caching the result until the next alloc/free, keeping all jobs with resources in a list, etc..
I've updated the description with a few todos, assuming this approach is acceptable.
I'm leaning towards splitting the resource.status RPC into two RPCs again. Since the job manager query is now somewhat costly (depending on the number of active jobs), it probably makes sense to avoid it when only the previous resource.status fields are really wanted.
FWIW, I really like the approach.
Hmm, I'm seeing this test fail occasionally in CI. Just going to restart for now.
2024-03-24T20:01:08.8286402Z expecting success: flux job attach fMT36St
2024-03-24T20:01:08.8286847Z flux-job: task(s) exited with exit code 1
2024-03-24T20:01:08.8287722Z t2492: Re-launching test script under flux-start using /usr/src/src/cmd/flux
2024-03-24T20:01:08.8288202Z t2492:
2024-03-24T20:01:08.8288460Z t2492: Testing handling of lost shell rank 3:
2024-03-24T20:01:08.8289067Z t2492: Sumbmitted job ffj4EwR. Waiting for shell rank 3 to be lost
2024-03-24T20:01:08.8291062Z [1m[32m[Mar24 20:00][0m [33mheader[0m [34mversion[0m=[37m1[0m [34mencoding[0m=[35m{"stdout":"UTF-8","stderr":"UTF-8"}[0m [34mcount[0m=[35m{"stdout":4,"stderr":4}[0m [34moptions[0m=[35m{}[0m
2024-03-24T20:01:08.8292925Z [32m[ +0.166865][0m [33mlog[0m [34mrank[0m=[37m0[0m [34mlevel[0m=[37m4[0m [34mmessage[0m=[35m"shell rank 3 (on fv-az701-331): Killed"[0m [34mcomponent[0m=[35m"exception"[0m [34mfile[0m=[35m"exception.c"[0m [34mline[0m=[37m49[0m
2024-03-24T20:01:08.8294014Z t2492: Sending SIGINT to ffj4EwR. Job should now exit
2024-03-24T20:01:08.8294517Z 0.000s: job.submit {"userid":1001,"urgency":16,"flags":0,"version":1}
2024-03-24T20:01:08.8294951Z 0.011s: job.validate
2024-03-24T20:01:08.8295199Z 0.022s: job.depend
2024-03-24T20:01:08.8295456Z 0.023s: job.priority {"priority":16}
2024-03-24T20:01:08.8296020Z 0.024s: job.alloc {"annotations":{"sched":{"resource_summary":"rank[0-3]/core0"}}}
2024-03-24T20:01:08.8296504Z 0.043s: job.start
2024-03-24T20:01:08.8296785Z flux-job: task(s) Killed
2024-03-24T20:01:08.8297065Z 0.429s: job.finish {"status":9}
2024-03-24T20:01:08.8297351Z 0.025s: exec.init
2024-03-24T20:01:08.8297580Z 0.026s: exec.starting
2024-03-24T20:01:08.8298076Z 0.100s: exec.shell.init {"service":"1001-shell-ffj4EwR","leader-rank":0,"size":4}
2024-03-24T20:01:08.8298847Z 0.116s: exec.shell.start {"taskmap":{"version":1,"map":[[0,4,1,1]]}}
2024-03-24T20:01:08.8299916Z 0.410s: exec.shell.task-exit {"localid":0,"rank":0,"state":"Exited","pid":242661,"wait_status":0,"signaled":0,"exitcode":0}
2024-03-24T20:01:08.8300555Z 0.428s: exec.complete {"status":9}
2024-03-24T20:01:08.8300859Z 0.428s: exec.done
2024-03-24T20:01:08.8301395Z 0.258s: flux-shell[0]: WARN: exception: exception.c:49: shell rank 3 (on fv-az701-331): Killed
2024-03-24T20:01:08.8301946Z waiting on barrier for 4 tasks
2024-03-24T20:01:08.8302234Z exited barrier
2024-03-24T20:01:08.8302527Z t2492: Job exited with rc=137 (expecting 137 (128+9))
2024-03-24T20:01:08.8302895Z t2492:
2024-03-24T20:01:08.8303158Z t2492: Testing handling of lost shell rank 1:
2024-03-24T20:01:08.8303627Z t2492: Sumbmitted job fvYw4PR. Waiting for shell rank 1 to be lost
2024-03-24T20:01:08.8304752Z [1m[32m[Mar24 20:00][0m [33mheader[0m [34mversion[0m=[37m1[0m [34mencoding[0m=[35m{"stdout":"UTF-8","stderr":"UTF-8"}[0m [34mcount[0m=[35m{"stdout":4,"stderr":4}[0m [34moptions[0m=[35m{}[0m
2024-03-24T20:01:08.8306470Z [32m[ +0.303951][0m [33mlog[0m [34mrank[0m=[37m0[0m [34mlevel[0m=[37m4[0m [34mmessage[0m=[35m"shell rank 1 (on fv-az701-331): Killed"[0m [34mcomponent[0m=[35m"exception"[0m [34mfile[0m=[35m"exception.c"[0m [34mline[0m=[37m49[0m
2024-03-24T20:01:08.8307555Z t2492: Sending SIGINT to fvYw4PR. Job should now exit
2024-03-24T20:01:08.8308049Z 0.000s: job.submit {"userid":1001,"urgency":16,"flags":0,"version":1}
2024-03-24T20:01:08.8308476Z 0.012s: job.validate
2024-03-24T20:01:08.8308723Z 0.023s: job.depend
2024-03-24T20:01:08.8308982Z 0.023s: job.priority {"priority":16}
2024-03-24T20:01:08.8309540Z 0.030s: job.alloc {"annotations":{"sched":{"resource_summary":"rank[0-3]/core0"}}}
2024-03-24T20:01:08.8310014Z 0.052s: job.start
2024-03-24T20:01:08.8310328Z flux-job: task(s) exited with exit code 130
2024-03-24T20:01:08.8310686Z 0.651s: job.finish {"status":33280}
2024-03-24T20:01:08.8310985Z 0.031s: exec.init
2024-03-24T20:01:08.8311213Z 0.032s: exec.starting
2024-03-24T20:01:08.8311706Z 0.140s: exec.shell.init {"service":"1001-shell-fvYw4PR","leader-rank":0,"size":4}
2024-03-24T20:01:08.8312313Z 0.153s: exec.shell.start {"taskmap":{"version":1,"map":[[0,4,1,1]]}}
2024-03-24T20:01:08.8313203Z 0.519s: exec.shell.task-exit {"localid":0,"rank":2,"state":"Exited","pid":243801,"wait_status":2,"signaled":2,"exitcode":130}
2024-03-24T20:01:08.8313848Z 0.651s: exec.complete {"status":33280}
2024-03-24T20:01:08.8314161Z 0.651s: exec.done
2024-03-24T20:01:08.8314799Z 0.398s: flux-shell[0]: WARN: exception: exception.c:49: shell rank 1 (on fv-az701-331): Killed
2024-03-24T20:01:08.8315458Z Traceback (most recent call last):
2024-03-24T20:01:08.8316030Z File "/tmp/flux-ancNrP/jobtmp-2-fvYw4PR/critical.py", line 4, in <module>
2024-03-24T20:01:08.8316504Z import flux
2024-03-24T20:01:08.8316912Z File "/usr/src/src/bindings/python/flux/__init__.py", line 14, in <module>
2024-03-24T20:01:08.8317409Z import flux.core.handle
2024-03-24T20:01:08.8317886Z File "/usr/src/src/bindings/python/flux/core/handle.py", line 19, in <module>
2024-03-24T20:01:08.8318398Z from flux.future import Future
2024-03-24T20:01:08.8318876Z File "/usr/src/src/bindings/python/flux/future.py", line 12, in <module>
2024-03-24T20:01:08.8319353Z import json
2024-03-24T20:01:08.8319709Z File "/usr/lib/python3.11/json/__init__.py", line 106, in <module>
2024-03-24T20:01:08.8320211Z from .decoder import JSONDecoder, JSONDecodeError
2024-03-24T20:01:08.8320727Z File "<frozen importlib._bootstrap>", line 1178, in _find_and_load
2024-03-24T20:01:08.8321309Z File "<frozen importlib._bootstrap>", line 1149, in _find_and_load_unlocked
2024-03-24T20:01:08.8322138Z File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
2024-03-24T20:01:08.8322733Z File "<frozen importlib._bootstrap_external>", line 936, in exec_module
2024-03-24T20:01:08.8323325Z File "<frozen importlib._bootstrap_external>", line 1032, in get_code
2024-03-24T20:01:08.8323887Z File "<frozen importlib._bootstrap_external>", line 1130, in get_data
2024-03-24T20:01:08.8324328Z KeyboardInterrupt
2024-03-24T20:01:08.8324596Z Traceback (most recent call last):
2024-03-24T20:01:08.8325145Z File "/tmp/flux-ancNrP/jobtmp-0-fvYw4PR/critical.py", line 4, in <module>
2024-03-24T20:01:08.8325619Z import flux
2024-03-24T20:01:08.8326034Z File "/usr/src/src/bindings/python/flux/__init__.py", line 14, in <module>
2024-03-24T20:01:08.8326534Z import flux.core.handle
2024-03-24T20:01:08.8327024Z File "/usr/src/src/bindings/python/flux/core/handle.py", line 19, in <module>
2024-03-24T20:01:08.8327558Z from flux.future import Future
2024-03-24T20:01:08.8328040Z File "/usr/src/src/bindings/python/flux/future.py", line 16, in <module>
2024-03-24T20:01:08.8328618Z from flux.util import check_future_error, interruptible
2024-03-24T20:01:08.8329170Z File "/usr/src/src/bindings/python/flux/util.py", line 15, in <module>
2024-03-24T20:01:08.8329629Z import glob
2024-03-24T20:01:08.8329987Z File "<frozen importlib._bootstrap>", line 1175, in _find_and_load
2024-03-24T20:01:08.8330514Z File "<frozen importlib._bootstrap>", line 171, in __enter__
2024-03-24T20:01:08.8331000Z File "<frozen importlib._bootstrap>", line 100, in acquire
2024-03-24T20:01:08.8331395Z KeyboardInterrupt
2024-03-24T20:01:08.8331710Z t2492: Job exited with rc=130 (expecting 137 (128+9))
2024-03-24T20:01:08.8332092Z t2492: Unexpected job exit code 130
2024-03-24T20:01:08.8332744Z Mar 24 20:00:30.720993 broker.err[0]: rc2.0: /usr/src/t/issues/t2492-shell-lost.sh Exited (rc=1) 1.8s
2024-03-24T20:01:08.8333392Z flux-start: 0 (pid 236602) exited with rc=1
2024-03-24T20:01:08.8334058Z Mar 24 20:00:32.653770 broker.err[0]: rc2.0: /usr/src/t/issues/t2492-shell-lost.sh Exited (rc=1) 5.6s
2024-03-24T20:01:08.8334651Z not ok 29 - t2492-shell-lost
I've been seeing that one regularly in CI, but haven't been able to reproduce locally. It appears the Python test script in the shell-lost test is getting sent SIGINT while it is still importing Python modules, which doesn't make sense because there's a barrier in there :confused:
One thing we can do is to move the try: to the beginning of the script.
which doesn't make sense because there's a barrier in there 😕
D'oh, the libflux flux_barrier() returns a future, but the Python handle barrier method returns nothing. So this is not really a barrier at all...
I think this is more or less ready for a review.
(LMK if you want me to squash those)
Feel free to squash those!