flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

WIP: relieve scheduler of the need to report resource status

Open garlick opened this issue 1 year ago • 25 comments

As discussed in #5776, this adds

  • resource tracking in the job manager and a new RPC to request the current set of allocated resources
  • expansion of the payload of resource.status to include the keys found in sched.resource-status
  • update users (top, python) to call resource.status instead of sched.resource-status

Marking as a WIP pending

  • [x] see what the throughput impact is
  • [x] ~decide what to do with the alloc-check plugin since this effectively does the same thing~ Edit: second iteration of the design no longer does this
  • [ ] decide if it was a good idea to combine the two RPCs or whether we should provide two, or perhaps take a step back and design a new one.
  • [ ] should we leave the scheduler RPC handlers in place and provide some way to use them in test? Edit: keep for back compat, but add tests so the code is not left uncovered
  • [ ] update flux-resource(1) which mentions "scheduler view of resources" a lot
  • [ ] flux resource status currently calls the resource.status RPCs twice

garlick avatar Mar 15 '24 14:03 garlick

Hmm, getting an ASAN failure here:

2024-03-15T15:21:32.2114399Z Making check in common
2024-03-15T15:21:32.2245352Z AddressSanitizer:DEADLYSIGNAL
2024-03-15T15:21:32.2246109Z =================================================================
2024-03-15T15:21:32.2249768Z [1m[31m==37186==ERROR: AddressSanitizer: SEGV on unknown address 0x63abbfc31750 (pc 0x7ab1a6bb7cb7 bp 0x000000000000 sp 0x7ffe4ec09ea0 T0)
2024-03-15T15:21:32.2251177Z [1m[0m==37186==The signal is caused by a READ memory access.
2024-03-15T15:21:32.2254299Z AddressSanitizer:DEADLYSIGNAL
2024-03-15T15:21:32.2254926Z AddressSanitizer: nested bug in the same thread, aborting.
2024-03-15T15:21:32.2258036Z make[1]: *** [Makefile:1027: check-recursive] Error 1
2024-03-15T15:21:32.2308431Z make: *** [Makefile:514: check-recursive] Error 1

and failing sched tests

2024-03-15T15:30:30.0668719Z 18/92 Test #20: t1012-find-status.t ...................***Failed    5.82 sec

2024-03-15T15:30:30.0744267Z expecting success: 
2024-03-15T15:30:30.0744721Z     flux resource list > resource.list.out &&
2024-03-15T15:30:30.0746652Z     validate_list_row resource.list.out 2 0 0 0 &&
2024-03-15T15:30:30.0747487Z     validate_list_row resource.list.out 3 4 176 16 &&
2024-03-15T15:30:30.0748166Z     validate_list_row resource.list.out 4 1 44 4
2024-03-15T15:30:30.0748609Z 
2024-03-15T15:30:30.0749421Z not ok 14 - find/status: flux resource list works

2024-03-15T15:30:30.0785797Z 27/92 Test #29: t1021-qmanager-nodex.t ................***Failed    6.44 sec

2024-03-15T15:30:30.0801907Z expecting success: 
2024-03-15T15:30:30.0802300Z     cat >status.expected1 <<-EOF &&
2024-03-15T15:30:30.0802597Z 	2 88 8
2024-03-15T15:30:30.0802784Z 	2 88 8
2024-03-15T15:30:30.0803047Z EOF
2024-03-15T15:30:30.0803284Z     flux resource list > resources.out1 &&
2024-03-15T15:30:30.0803756Z     cat resources.out1 | grep -E "(free|alloc)" \
2024-03-15T15:30:30.0804304Z 	| awk "{ print \$2,\$3,\$4 }" > status.out1 &&
2024-03-15T15:30:30.0804878Z     test_cmp status.expected1 status.out1
2024-03-15T15:30:30.0805121Z 
2024-03-15T15:30:30.0805347Z --- status.expected1	2024-03-15 15:29:16.098812556 +0000
2024-03-15T15:30:30.0805809Z +++ status.out1	2024-03-15 15:29:16.286811847 +0000
2024-03-15T15:30:30.0806168Z @@ -1,2 +1,2 @@
2024-03-15T15:30:30.0806398Z -2 88 8
2024-03-15T15:30:30.0806601Z -2 88 8
2024-03-15T15:30:30.0806781Z +4 88 16
2024-03-15T15:30:30.0806973Z +2 88 0
2024-03-15T15:30:30.0807303Z not ok 6 - qmanager-nodex: free/alloc node count (hinodex)

2024-03-15T15:30:30.0832122Z expecting success: 
2024-03-15T15:30:30.0832615Z     cat >status.expected2 <<-EOF &&
2024-03-15T15:30:30.0833130Z 	0 0 0
2024-03-15T15:30:30.0833446Z 	4 176 16
2024-03-15T15:30:30.0833770Z EOF
2024-03-15T15:30:30.0834161Z     flux resource list > resources.out2 &&
2024-03-15T15:30:30.0834930Z     cat resources.out2 | grep -E "(free|alloc)" \
2024-03-15T15:30:30.0835613Z         | awk "{ print \$2,\$3,\$4 }" > status.out2 &&
2024-03-15T15:30:30.0836241Z     test_cmp status.expected2 status.out2
2024-03-15T15:30:30.0836615Z 
2024-03-15T15:30:30.0836947Z --- status.expected2	2024-03-15 15:29:16.670810399 +0000
2024-03-15T15:30:30.0837648Z +++ status.out2	2024-03-15 15:29:16.830809797 +0000
2024-03-15T15:30:30.0838235Z @@ -1,2 +1,2 @@
2024-03-15T15:30:30.0838607Z -0 0 0
2024-03-15T15:30:30.0838961Z -4 176 16
2024-03-15T15:30:30.0839303Z +4 0 16
2024-03-15T15:30:30.0839643Z +4 176 0
2024-03-15T15:30:30.0840206Z not ok 9 - qmanager-nodex: free/alloc node count 2 (hinodex)

2024-03-15T15:30:30.0865149Z expecting success: 
2024-03-15T15:30:30.0865545Z     cat >status.expected3 <<-EOF &&
2024-03-15T15:30:30.0865849Z 	2 88 8
2024-03-15T15:30:30.0866036Z 	2 88 8
2024-03-15T15:30:30.0866301Z EOF
2024-03-15T15:30:30.0866548Z     flux resource list > resources.out3 &&
2024-03-15T15:30:30.0867207Z     cat resources.out3 | grep -E "(free|alloc)" \
2024-03-15T15:30:30.0867639Z         | awk "{ print \$2,\$3,\$4 }" > status.out3 &&
2024-03-15T15:30:30.0868319Z     test_cmp status.expected3 status.out3
2024-03-15T15:30:30.0868701Z 
2024-03-15T15:30:30.0869053Z --- status.expected3	2024-03-15 15:29:17.794806161 +0000
2024-03-15T15:30:30.0869520Z +++ status.out3	2024-03-15 15:29:17.894805784 +0000
2024-03-15T15:30:30.0869875Z @@ -1,2 +1,2 @@
2024-03-15T15:30:30.0870177Z -2 88 8
2024-03-15T15:30:30.0870528Z -2 88 8
2024-03-15T15:30:30.0870721Z +4 88 16
2024-03-15T15:30:30.0870906Z +2 88 0
2024-03-15T15:30:30.0871255Z not ok 16 - qmanager-nodex: free/alloc node count (lonodex)

2024-03-15T15:30:30.0880480Z expecting success: 
2024-03-15T15:30:30.0880794Z     cat >status.expected4 <<-EOF &&
2024-03-15T15:30:30.0881078Z 	0 0 0
2024-03-15T15:30:30.0881271Z 	4 176 16
2024-03-15T15:30:30.0881467Z EOF
2024-03-15T15:30:30.0881692Z     flux resource list > resources.out4 &&
2024-03-15T15:30:30.0882111Z     cat resources.out4 | grep -E "(free|alloc)" \
2024-03-15T15:30:30.0882511Z         | awk "{ print \$2,\$3,\$4 }" > status.out4 &&
2024-03-15T15:30:30.0882887Z     test_cmp status.expected4 status.out4
2024-03-15T15:30:30.0883127Z 
2024-03-15T15:30:30.0883381Z --- status.expected4	2024-03-15 15:29:18.218804562 +0000
2024-03-15T15:30:30.0883894Z +++ status.out4	2024-03-15 15:29:18.442803717 +0000
2024-03-15T15:30:30.0884254Z @@ -1,2 +1,2 @@
2024-03-15T15:30:30.0884514Z -0 0 0
2024-03-15T15:30:30.0884909Z -4 176 16
2024-03-15T15:30:30.0885248Z +4 0 16
2024-03-15T15:30:30.0885587Z +4 176 0
2024-03-15T15:30:30.0886227Z not ok 19 - qmanager-nodex: free/alloc node count 2 (lonodex)


garlick avatar Mar 15 '24 15:03 garlick

Another thing to consider here is backwards compatibility (perhaps a good reason to keep the sched.resource-status RPC for a bit). E.g. if a newer instance of Flux is running under an older version, tools like flux jobs, flux top, and flux pstree would get ENOSYS when attempting to get the subinstance resource information.

grondo avatar Mar 15 '24 16:03 grondo

I'm looking into the ASan errors, which are reproducible but only in the CI environment. In this case, libasan is getting a SEGV while running make. Here's more info:

runner@fv-az523-355:/usr/src/src$ ASAN_OPTIONS=$ASAN_OPTIONS,verbosity=1 LD_PRELOAD=/usr/lib64/libasan.so.6 make
==134==AddressSanitizer: failed to intercept '__isoc99_printf'
'==134==AddressSanitizer: failed to intercept '__isoc99_sprintf'
'==134==AddressSanitizer: failed to intercept '__isoc99_snprintf'
'==134==AddressSanitizer: failed to intercept '__isoc99_fprintf'
'==134==AddressSanitizer: failed to intercept '__isoc99_vprintf'
'==134==AddressSanitizer: failed to intercept '__isoc99_vsprintf'
'==134==AddressSanitizer: failed to intercept '__isoc99_vsnprintf'
'==134==AddressSanitizer: failed to intercept '__isoc99_vfprintf'
'==134==AddressSanitizer: failed to intercept 'xdr_quad_t'
'==134==AddressSanitizer: failed to intercept 'xdr_u_quad_t'
'==134==AddressSanitizer: failed to intercept 'xdr_destroy'
'==134==AddressSanitizer: libc interceptors initialized
|| `[0x10007fff8000, 0x7fffffffffff]` || HighMem    ||
|| `[0x02008fff7000, 0x10007fff7fff]` || HighShadow ||
|| `[0x00008fff7000, 0x02008fff6fff]` || ShadowGap  ||
|| `[0x00007fff8000, 0x00008fff6fff]` || LowShadow  ||
|| `[0x000000000000, 0x00007fff7fff]` || LowMem     ||
MemToShadow(shadow): 0x00008fff7000 0x000091ff6dff 0x004091ff6e00 0x02008fff6fff
redzone=16
max_redzone=2048
quarantine_size_mb=256M
thread_local_quarantine_size_kb=1024K
malloc_context_size=30
SHADOW_SCALE: 3
SHADOW_GRANULARITY: 8
SHADOW_OFFSET: 0x7fff8000
==134==Installed the sigaction for signal 11
==134==Installed the sigaction for signal 7
==134==Installed the sigaction for signal 8
==134==Deactivating ASan
AddressSanitizer:DEADLYSIGNAL
=================================================================
==134==ERROR: AddressSanitizer: SEGV on unknown address 0x60cf8ad2c9f8 (pc 0x71fac17b5cb7 bp 0x000000000000 sp 0x7ffd602751c0 T0)
==134==The signal is caused by a READ memory access.
AddressSanitizer:DEADLYSIGNAL
AddressSanitizer: nested bug in the same thread, aborting.

I have no idea why this just started happening, nor what to try next. Maybe I will try moving the asan to Fedora 36 and see if that addresses this issue.

grondo avatar Mar 15 '24 20:03 grondo

Looking at the first sched test failure, the test fakes resources consisting of 4 nodes, each with 44 cores and 4 gpus. In the first failure, it allocates two nodes exclusively and expects flux resource list to show

  • free: 2 nodes, 88 cores, 8 gpus
  • alloc: 2 nodes, 88 cores, 8 gpus

But what it actually shows is

  • free: 4 nodes, 88 cores, 16 gpus
  • alloc: 2 nodes, 88 cores, 0 gpus

I think what may be going on is that librlist doesn't handle gpus for the things I'm asking it to do.

For completeness, the new objects returned are:

all

{
  "version": 1,
  "execution": {
    "R_lite": [
      {
        "rank": "0-3",
        "children": {
          "core": "0-43",
          "gpu": "0-3"
        }
      }
    ],
    "starttime": 0,
    "expiration": 0,
    "nodelist": [
      "sierra[3682,3179,3683,3178]"
    ]
  }
}

alloc

{
  "version": 1,
  "execution": {
    "R_lite": [
      {
        "rank": "0,2",
        "children": {
          "core": "0-43"
        }
      }
    ],
    "starttime": 0,
    "expiration": 0,
    "nodelist": [
      "sierra[3682-3683]"
    ]
  }
}

garlick avatar Mar 16 '24 23:03 garlick

Uhoh, that sounds like an unfortunate bug :-(

grondo avatar Mar 17 '24 00:03 grondo

FWIW I was able to fix the sched test failures by returning the allocated resource set from the job-manager directly in resource.status instead of marking those resources allocated the full rlist and then extracting them again with rlist_copy_allocated().

Some other tests are failing now in flux-core that weren't before so I need to sort that out then post an update here.

garlick avatar Mar 20 '24 13:03 garlick

I wonder how much work it would be to properly support gpus in librlist? Perhaps that would then allow sched-simple to allocate them (currently it cannot).

My other worry here is that librlist starts to get pretty slow when there are a lot of resources. I worry this will have a significant impact on throughput (I know you were already planning on testing that, but just a heads up). For example, in the testing in flux-framework/flux-sched#1137, we saw sched.resource-status for sched-simple taking >2s for 16K nodes.

grondo avatar Mar 20 '24 14:03 grondo

Ah: the allocated set from job-manager.resource-status does not include properties so tests involving queues are failing in t2350-resource-list.t and t2801-top-cmd.t. I'm not sure if this is because sched-simple isn't returning them in the alloc response, or they are disappearing when passing through rlist_append(). But anyway before they were included because the allocated set was being carved out of the original R with rlist_copy_allocated().

garlick avatar Mar 20 '24 14:03 garlick

Ah, probably dropped in rlist_append and that seems like a bug :-(

grondo avatar Mar 20 '24 14:03 grondo

I wonder how much work it would be to properly support gpus in librlist? Perhaps that would then allow sched-simple to allocate > them (currently it cannot).

Yeah that would allow the original technique I was using to work. Supporting gpus in sched-simple seems like a pretty big bonus as well!

My other worry here is that librlist starts to get pretty slow when there are a lot of resources.

I was hoping that the rlist_append() / rlist_diff() functions I am using in job-manager would not be too bad, but after I get this working, I'll probe that with a large resource set.

garlick avatar Mar 20 '24 14:03 garlick

Hmm, actually rlist_append() seems to pass properties through OK, but it doesn't look like sched-simple is returning queue properties in allocated Rs. Is it possible that was by design?

garlick avatar Mar 20 '24 16:03 garlick

but it doesn't look like sched-simple is returning queue properties in allocated Rs. Is it possible that was by design?

I'm not sure, but I did think properties were passed through an alloc, e.g. on fluke the queue properties are available in subinstances (whether this is right or not is questionable)

 grondo@fluke108:~$ flux alloc -N1 flux resource list
     STATE PROPERTIES NNODES   NCORES    NGPUS NODELIST
      free batch           1        4        0 fluke95
 allocated                 0        0        0 
      down                 0        0        0 
[detached: session exiting]

grondo avatar Mar 20 '24 16:03 grondo

Hmm, maybe something was wrong with my test then. Thanks for that!

Edit: oh, that would be fluxion though. Maybe sched-simple behaves differently.

garlick avatar Mar 20 '24 17:03 garlick

I added a workaround for now, in which properties are re-added to the allocated set by the resource module if missing.

garlick avatar Mar 20 '24 21:03 garlick

FWIW throughput.py is stable at about 21 jobs/sec with 16K nodes/32 cores each, on master and on this branch with sched-simple. :shrug:

garlick avatar Mar 21 '24 22:03 garlick

Oof. I guess sched-simple is so slow at that scale it doesn't matter. I'd like to see results for something smaller as well just in case though.

grondo avatar Mar 21 '24 22:03 grondo

Here are some more numbers. It does look like there is a negative effect.

throughput.py -n 1000 (each run in a new instance)

Node: 32 cores / 4 gpus

        fluxion simple  fluxion simple
        master  master  #5776   #5776
nnodes  thruput thruput thruput thruput
64      595.4   497.0   551.8   505.9
        625.9   523.2   570.5   524.1
        600.4   508.8   527.7   497.4
128     441.4   522.4   429.6   497.6
        468.7   495.4   440.9   485.9
        484.6   505.2   426.3   475.3
256     251.7   501.7   242.0   469.6
        254.7   499.7   238.6   481.4
        249.9   491.5   233.3   454.2
512     142.5   463.2   138.4   456.5
        143.6   482.7   138.2   456.8
        146.2   465.7   137.3   460.5
1024    69.0    443.1   67.8    431.7
        68.6    472.7   68.6    428.7
        68.6    466.5   68.4    449.3
2048    33.1    434.5   32.6    428.5
        32.8    458.5   32.8    428.3
        32.8    447.5   33.5    435.2
4096    16.9    301.5   16.5    285.4
        16.9    293.1   16.5    275.2
        16.8    288.0   16.6    283.2
8192    8.3     104.1   8.2     103.3
        8.3     108.2   8.4     97.7
        8.3     108.0   8.3     103.3
16384   4.1     24.2    4.0     24.0
        4.1     23.9    4.0     23.5
        4.0     24.3    4.1     24.0

garlick avatar Mar 22 '24 00:03 garlick

Hmm, maybe instead of keeping a "running R" in the job manager, it would be better (less impact on throughput) to simply gather the R's upon request and either combine them in the job manager or even ship them to the resource module and do it there.

garlick avatar Mar 22 '24 14:03 garlick

The impact is not as bad as I'd feared! The rlist implementation is kind of dumb and not optimized at all, so we could look into improving it. This would help a bit with #5819 too.

However, since each time a user runs flux resource it will interrupt the job manager, moving the resource-list service to the resource or another module like we did for the job-list service is not a bad idea IMO.

I'll also note that the slow sched.resource-status RPC is an immediate concern on existing clusters, so a solution that works within the next couple releases would be better than spending more time implementing the "perfect" solution (However, it would probably be bad to break compatibility multiple times)

Sorry, all that may not have been too helpful. Just throwing some thoughts out there.

grondo avatar Mar 22 '24 14:03 grondo

In case it wasn't clear, this PR already moves the contact point for the tools to the resource module, but the resource module doesn't do a whole lot except prepare the response. The RPC in the job manager that returns the allocated set is used exclusively by the resource module. So my comment above was about shifting some of the work from the job manager to the resource module to reduce d the penalty on every allocation.

garlick avatar Mar 22 '24 15:03 garlick

So my comment above was about shifting some of the work from the job manager to the resource module to reduce d the penalty on every allocation.

That makes sense, sorry I misunderstood. So on receipt of a resource.status request, the resource module then requests the allocated set from the job manager, but in the current implementation the job-manager has already prepared the allocated resource set, so all it has to do is return it? This is a pretty clever implementation IMO.

Shifting some of the work to the resource module seems like a good idea to me. Your comment about just shipping to the resource module makes a lot of sense now that I better understand.

grondo avatar Mar 22 '24 15:03 grondo

Just pushed the change discussed above. I started testing throughput and then realized this doesn't touch the critical path at all so there is little point.

The job manager RPC now just walks all active jobs, creating a union of all allocated Rs, which it returns. It's kind of dumb and simple, but if it turns out to be slow in some scenario (for example if there are a huge number of active jobs), there are obvious things to do to improve responsiveness such as caching the result until the next alloc/free, keeping all jobs with resources in a list, etc..

garlick avatar Mar 23 '24 02:03 garlick

I've updated the description with a few todos, assuming this approach is acceptable.

I'm leaning towards splitting the resource.status RPC into two RPCs again. Since the job manager query is now somewhat costly (depending on the number of active jobs), it probably makes sense to avoid it when only the previous resource.status fields are really wanted.

garlick avatar Mar 23 '24 19:03 garlick

FWIW, I really like the approach.

grondo avatar Mar 23 '24 19:03 grondo

Hmm, I'm seeing this test fail occasionally in CI. Just going to restart for now.

2024-03-24T20:01:08.8286402Z expecting success: flux job attach fMT36St
2024-03-24T20:01:08.8286847Z flux-job: task(s) exited with exit code 1
2024-03-24T20:01:08.8287722Z t2492: Re-launching test script under flux-start using /usr/src/src/cmd/flux
2024-03-24T20:01:08.8288202Z t2492: 
2024-03-24T20:01:08.8288460Z t2492: Testing handling of lost shell rank 3:
2024-03-24T20:01:08.8289067Z t2492: Sumbmitted job ffj4EwR. Waiting for shell rank 3 to be lost
2024-03-24T20:01:08.8291062Z [1m[32m[Mar24 20:00][0m [33mheader[0m [34mversion[0m=[37m1[0m [34mencoding[0m=[35m{"stdout":"UTF-8","stderr":"UTF-8"}[0m [34mcount[0m=[35m{"stdout":4,"stderr":4}[0m [34moptions[0m=[35m{}[0m
2024-03-24T20:01:08.8292925Z [32m[  +0.166865][0m [33mlog[0m [34mrank[0m=[37m0[0m [34mlevel[0m=[37m4[0m [34mmessage[0m=[35m"shell rank 3 (on fv-az701-331): Killed"[0m [34mcomponent[0m=[35m"exception"[0m [34mfile[0m=[35m"exception.c"[0m [34mline[0m=[37m49[0m
2024-03-24T20:01:08.8294014Z t2492: Sending SIGINT to ffj4EwR. Job should now exit
2024-03-24T20:01:08.8294517Z 0.000s: job.submit {"userid":1001,"urgency":16,"flags":0,"version":1}
2024-03-24T20:01:08.8294951Z 0.011s: job.validate
2024-03-24T20:01:08.8295199Z 0.022s: job.depend
2024-03-24T20:01:08.8295456Z 0.023s: job.priority {"priority":16}
2024-03-24T20:01:08.8296020Z 0.024s: job.alloc {"annotations":{"sched":{"resource_summary":"rank[0-3]/core0"}}}
2024-03-24T20:01:08.8296504Z 0.043s: job.start
2024-03-24T20:01:08.8296785Z flux-job: task(s) Killed
2024-03-24T20:01:08.8297065Z 0.429s: job.finish {"status":9}
2024-03-24T20:01:08.8297351Z 0.025s: exec.init
2024-03-24T20:01:08.8297580Z 0.026s: exec.starting
2024-03-24T20:01:08.8298076Z 0.100s: exec.shell.init {"service":"1001-shell-ffj4EwR","leader-rank":0,"size":4}
2024-03-24T20:01:08.8298847Z 0.116s: exec.shell.start {"taskmap":{"version":1,"map":[[0,4,1,1]]}}
2024-03-24T20:01:08.8299916Z 0.410s: exec.shell.task-exit {"localid":0,"rank":0,"state":"Exited","pid":242661,"wait_status":0,"signaled":0,"exitcode":0}
2024-03-24T20:01:08.8300555Z 0.428s: exec.complete {"status":9}
2024-03-24T20:01:08.8300859Z 0.428s: exec.done
2024-03-24T20:01:08.8301395Z 0.258s: flux-shell[0]:  WARN: exception: exception.c:49: shell rank 3 (on fv-az701-331): Killed
2024-03-24T20:01:08.8301946Z waiting on barrier for 4 tasks
2024-03-24T20:01:08.8302234Z exited barrier
2024-03-24T20:01:08.8302527Z t2492: Job exited with rc=137 (expecting 137 (128+9))
2024-03-24T20:01:08.8302895Z t2492: 
2024-03-24T20:01:08.8303158Z t2492: Testing handling of lost shell rank 1:
2024-03-24T20:01:08.8303627Z t2492: Sumbmitted job fvYw4PR. Waiting for shell rank 1 to be lost
2024-03-24T20:01:08.8304752Z [1m[32m[Mar24 20:00][0m [33mheader[0m [34mversion[0m=[37m1[0m [34mencoding[0m=[35m{"stdout":"UTF-8","stderr":"UTF-8"}[0m [34mcount[0m=[35m{"stdout":4,"stderr":4}[0m [34moptions[0m=[35m{}[0m
2024-03-24T20:01:08.8306470Z [32m[  +0.303951][0m [33mlog[0m [34mrank[0m=[37m0[0m [34mlevel[0m=[37m4[0m [34mmessage[0m=[35m"shell rank 1 (on fv-az701-331): Killed"[0m [34mcomponent[0m=[35m"exception"[0m [34mfile[0m=[35m"exception.c"[0m [34mline[0m=[37m49[0m
2024-03-24T20:01:08.8307555Z t2492: Sending SIGINT to fvYw4PR. Job should now exit
2024-03-24T20:01:08.8308049Z 0.000s: job.submit {"userid":1001,"urgency":16,"flags":0,"version":1}
2024-03-24T20:01:08.8308476Z 0.012s: job.validate
2024-03-24T20:01:08.8308723Z 0.023s: job.depend
2024-03-24T20:01:08.8308982Z 0.023s: job.priority {"priority":16}
2024-03-24T20:01:08.8309540Z 0.030s: job.alloc {"annotations":{"sched":{"resource_summary":"rank[0-3]/core0"}}}
2024-03-24T20:01:08.8310014Z 0.052s: job.start
2024-03-24T20:01:08.8310328Z flux-job: task(s) exited with exit code 130
2024-03-24T20:01:08.8310686Z 0.651s: job.finish {"status":33280}
2024-03-24T20:01:08.8310985Z 0.031s: exec.init
2024-03-24T20:01:08.8311213Z 0.032s: exec.starting
2024-03-24T20:01:08.8311706Z 0.140s: exec.shell.init {"service":"1001-shell-fvYw4PR","leader-rank":0,"size":4}
2024-03-24T20:01:08.8312313Z 0.153s: exec.shell.start {"taskmap":{"version":1,"map":[[0,4,1,1]]}}
2024-03-24T20:01:08.8313203Z 0.519s: exec.shell.task-exit {"localid":0,"rank":2,"state":"Exited","pid":243801,"wait_status":2,"signaled":2,"exitcode":130}
2024-03-24T20:01:08.8313848Z 0.651s: exec.complete {"status":33280}
2024-03-24T20:01:08.8314161Z 0.651s: exec.done
2024-03-24T20:01:08.8314799Z 0.398s: flux-shell[0]:  WARN: exception: exception.c:49: shell rank 1 (on fv-az701-331): Killed
2024-03-24T20:01:08.8315458Z Traceback (most recent call last):
2024-03-24T20:01:08.8316030Z   File "/tmp/flux-ancNrP/jobtmp-2-fvYw4PR/critical.py", line 4, in <module>
2024-03-24T20:01:08.8316504Z     import flux
2024-03-24T20:01:08.8316912Z   File "/usr/src/src/bindings/python/flux/__init__.py", line 14, in <module>
2024-03-24T20:01:08.8317409Z     import flux.core.handle
2024-03-24T20:01:08.8317886Z   File "/usr/src/src/bindings/python/flux/core/handle.py", line 19, in <module>
2024-03-24T20:01:08.8318398Z     from flux.future import Future
2024-03-24T20:01:08.8318876Z   File "/usr/src/src/bindings/python/flux/future.py", line 12, in <module>
2024-03-24T20:01:08.8319353Z     import json
2024-03-24T20:01:08.8319709Z   File "/usr/lib/python3.11/json/__init__.py", line 106, in <module>
2024-03-24T20:01:08.8320211Z     from .decoder import JSONDecoder, JSONDecodeError
2024-03-24T20:01:08.8320727Z   File "<frozen importlib._bootstrap>", line 1178, in _find_and_load
2024-03-24T20:01:08.8321309Z   File "<frozen importlib._bootstrap>", line 1149, in _find_and_load_unlocked
2024-03-24T20:01:08.8322138Z   File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
2024-03-24T20:01:08.8322733Z   File "<frozen importlib._bootstrap_external>", line 936, in exec_module
2024-03-24T20:01:08.8323325Z   File "<frozen importlib._bootstrap_external>", line 1032, in get_code
2024-03-24T20:01:08.8323887Z   File "<frozen importlib._bootstrap_external>", line 1130, in get_data
2024-03-24T20:01:08.8324328Z KeyboardInterrupt
2024-03-24T20:01:08.8324596Z Traceback (most recent call last):
2024-03-24T20:01:08.8325145Z   File "/tmp/flux-ancNrP/jobtmp-0-fvYw4PR/critical.py", line 4, in <module>
2024-03-24T20:01:08.8325619Z     import flux
2024-03-24T20:01:08.8326034Z   File "/usr/src/src/bindings/python/flux/__init__.py", line 14, in <module>
2024-03-24T20:01:08.8326534Z     import flux.core.handle
2024-03-24T20:01:08.8327024Z   File "/usr/src/src/bindings/python/flux/core/handle.py", line 19, in <module>
2024-03-24T20:01:08.8327558Z     from flux.future import Future
2024-03-24T20:01:08.8328040Z   File "/usr/src/src/bindings/python/flux/future.py", line 16, in <module>
2024-03-24T20:01:08.8328618Z     from flux.util import check_future_error, interruptible
2024-03-24T20:01:08.8329170Z   File "/usr/src/src/bindings/python/flux/util.py", line 15, in <module>
2024-03-24T20:01:08.8329629Z     import glob
2024-03-24T20:01:08.8329987Z   File "<frozen importlib._bootstrap>", line 1175, in _find_and_load
2024-03-24T20:01:08.8330514Z   File "<frozen importlib._bootstrap>", line 171, in __enter__
2024-03-24T20:01:08.8331000Z   File "<frozen importlib._bootstrap>", line 100, in acquire
2024-03-24T20:01:08.8331395Z KeyboardInterrupt
2024-03-24T20:01:08.8331710Z t2492: Job exited with rc=130 (expecting 137 (128+9))
2024-03-24T20:01:08.8332092Z t2492: Unexpected job exit code 130
2024-03-24T20:01:08.8332744Z Mar 24 20:00:30.720993 broker.err[0]: rc2.0: /usr/src/t/issues/t2492-shell-lost.sh Exited (rc=1) 1.8s
2024-03-24T20:01:08.8333392Z flux-start: 0 (pid 236602) exited with rc=1
2024-03-24T20:01:08.8334058Z Mar 24 20:00:32.653770 broker.err[0]: rc2.0: /usr/src/t/issues/t2492-shell-lost.sh Exited (rc=1) 5.6s
2024-03-24T20:01:08.8334651Z not ok 29 - t2492-shell-lost

garlick avatar Mar 24 '24 20:03 garlick

I've been seeing that one regularly in CI, but haven't been able to reproduce locally. It appears the Python test script in the shell-lost test is getting sent SIGINT while it is still importing Python modules, which doesn't make sense because there's a barrier in there :confused:

One thing we can do is to move the try: to the beginning of the script.

grondo avatar Mar 24 '24 20:03 grondo

which doesn't make sense because there's a barrier in there 😕

D'oh, the libflux flux_barrier() returns a future, but the Python handle barrier method returns nothing. So this is not really a barrier at all...

grondo avatar Mar 24 '24 21:03 grondo

I think this is more or less ready for a review.

garlick avatar Mar 24 '24 21:03 garlick

(LMK if you want me to squash those)

garlick avatar Mar 24 '24 21:03 garlick

Feel free to squash those!

grondo avatar Mar 24 '24 23:03 grondo