flux-core job-manager: problem with alloc queue on elcap

On elcap no jobs are being scheduled and flux queue status -vvv shows:

0 alloc requests queued
2540 alloc requests pending to scheduler
34 running jobs

and after stopping all queues:

0 alloc requests queued
2530 alloc requests pending to scheduler
34 running jobs

The admins have been reconfiguring queues by reloading the scheduler and core resource modules. I wonder if this could be the cause.

I also cannot prove that this is the reason no jobs are being scheduled, but it seems likely.

Possibly related: #5964?

May 19 '24 22:05 grondo

Forgot to mention that the apparent real number of pending jobs is 25:

# flux jobs -Af pending | wc -l
25
# flux module stats job-manager 
{
 "journal": {
  "listeners": 1
 },
 "active_jobs": 57,
 "inactive_jobs": 16598,
 "max_jobid": 402543535884075008
}

May 19 '24 22:05 grondo

Actually I'm unsure that this is causing the scheduling issue because the job manager should be sending unlimited alloc requests to the scheduler. A bit stumped at this point.

May 19 '24 23:05 grondo

FYI - reloading fluxion modules resolved this situation:

0 alloc requests queued
1 alloc requests pending to scheduler
41 running jobs

May 19 '24 23:05 grondo

I was able to reproduce this on my test system with a node that was down:

fluxion with easy queue policy
multiple queues
node was down before flux started

To reproduce:

submit a job with --requires=host:X where X is the down node
cancel it
observe leaked alloc in flux queue status

Still probing to determine what of the above characteristics are actually required to reproduce. So far I've confirmed it does not reproduce in a sub instance with a drained node, no queues, and fcfs queue policy.

May 20 '24 14:05 garlick

Well it seems neither the --requires=host:X nor starting flux with a node down is required. It works to simply drain a node then submit a job that can't run without one more node, then cancel it.

 garlick@picl0:~$ sudo flux resource drain picl1
 garlick@picl0:~$ flux submit -N2 -q debug hostname
ƒ2deLurvY3R
 garlick@picl0:~$ flux jobs
       JOBID QUEUE    USER     NAME       ST NTASKS NNODES     TIME INFO
 ƒ2deLurvY3R debug    garlick  hostname    S      2      2      30s 
 garlick@picl0:~$ flux queue status -v
debug: Job submission is enabled
debug: Scheduling is started
all: Job submission is enabled
all: Scheduling is started
admin: Job submission is enabled
admin: Scheduling is started
batch: Job submission is enabled
batch: Scheduling is started
0 alloc requests queued
1 alloc requests pending to scheduler
0 running jobs
 garlick@picl0:~$ flux cancel $(flux job last)
 garlick@picl0:~$ flux jobs
       JOBID USER     NAME       ST NTASKS NNODES     TIME INFO
 garlick@picl0:~$ flux queue status -v
debug: Job submission is enabled
debug: Scheduling is started
all: Job submission is enabled
all: Scheduling is started
admin: Job submission is enabled
admin: Scheduling is started
batch: Job submission is enabled
batch: Scheduling is started
0 alloc requests queued
1 alloc requests pending to scheduler
0 running jobs

May 20 '24 16:05 garlick

OK it's trivially reproduceable in a standalone flux instance with no queues if the default queue policy of fcfs is changed to easy:

$ cat fluxion.toml
[sched-fluxion-qmanager]
queue-policy = "easy"
$ flux start -s2 -o,--conf=fluxion.toml
$ flux resource drain 0
$ flux submit -N2 hostname
ƒ2VGsWxto
$ flux cancel $(flux job last)
May 20 09:48:27.112761 sched-fluxion-qmanager.err[0]: jobmanager_cancel_cb: remove job (3284324581376): No such file or directory
$ flux jobs
       JOBID USER     NAME       ST NTASKS NNODES     TIME INFO
$ flux queue status -v
Job submission is enabled
Scheduling is started
0 alloc requests queued
1 alloc requests pending to scheduler
0 running jobs
$ exit
<hang>

May 20 '24 16:05 garlick

cc @trws since it was surprising that jobs were not scheduled by Fluxion when we hit this issue.

May 23 '24 15:05 grondo

Thanks @grondo, on my list for today to look into this. I think it should be solved by the change I pushed in the other day to deal with flux-framework/flux-sched#1208, but it's so easy for this to go wrong in unexpected ways I want to be completely sure.

May 23 '24 15:05 trws

Let's close this issue. I opened flux-framework/flux-sched#1210 for athe sched follow-up

May 24 '24 17:05 garlick

flux-core flux-core copied to clipboard

job-manager: problem with alloc queue on elcap

flux-core
flux-core copied to clipboard