flux-core
flux-core copied to clipboard
job-manager: problem with alloc queue on elcap
On elcap no jobs are being scheduled and flux queue status -vvv shows:
0 alloc requests queued
2540 alloc requests pending to scheduler
34 running jobs
and after stopping all queues:
0 alloc requests queued
2530 alloc requests pending to scheduler
34 running jobs
The admins have been reconfiguring queues by reloading the scheduler and core resource modules. I wonder if this could be the cause.
I also cannot prove that this is the reason no jobs are being scheduled, but it seems likely.
Possibly related: #5964?
Forgot to mention that the apparent real number of pending jobs is 25:
# flux jobs -Af pending | wc -l
25
# flux module stats job-manager
{
"journal": {
"listeners": 1
},
"active_jobs": 57,
"inactive_jobs": 16598,
"max_jobid": 402543535884075008
}
Actually I'm unsure that this is causing the scheduling issue because the job manager should be sending unlimited alloc requests to the scheduler. A bit stumped at this point.
FYI - reloading fluxion modules resolved this situation:
0 alloc requests queued
1 alloc requests pending to scheduler
41 running jobs
I was able to reproduce this on my test system with a node that was down:
- fluxion with easy queue policy
- multiple queues
- node was down before flux started
To reproduce:
- submit a job with
--requires=host:Xwhere X is the down node - cancel it
- observe leaked alloc in
flux queue status
Still probing to determine what of the above characteristics are actually required to reproduce. So far I've confirmed it does not reproduce in a sub instance with a drained node, no queues, and fcfs queue policy.
Well it seems neither the --requires=host:X nor starting flux with a node down is required. It works to simply drain a node then submit a job that can't run without one more node, then cancel it.
garlick@picl0:~$ sudo flux resource drain picl1
garlick@picl0:~$ flux submit -N2 -q debug hostname
ƒ2deLurvY3R
garlick@picl0:~$ flux jobs
JOBID QUEUE USER NAME ST NTASKS NNODES TIME INFO
ƒ2deLurvY3R debug garlick hostname S 2 2 30s
garlick@picl0:~$ flux queue status -v
debug: Job submission is enabled
debug: Scheduling is started
all: Job submission is enabled
all: Scheduling is started
admin: Job submission is enabled
admin: Scheduling is started
batch: Job submission is enabled
batch: Scheduling is started
0 alloc requests queued
1 alloc requests pending to scheduler
0 running jobs
garlick@picl0:~$ flux cancel $(flux job last)
garlick@picl0:~$ flux jobs
JOBID USER NAME ST NTASKS NNODES TIME INFO
garlick@picl0:~$ flux queue status -v
debug: Job submission is enabled
debug: Scheduling is started
all: Job submission is enabled
all: Scheduling is started
admin: Job submission is enabled
admin: Scheduling is started
batch: Job submission is enabled
batch: Scheduling is started
0 alloc requests queued
1 alloc requests pending to scheduler
0 running jobs
OK it's trivially reproduceable in a standalone flux instance with no queues if the default queue policy of fcfs is changed to easy:
$ cat fluxion.toml
[sched-fluxion-qmanager]
queue-policy = "easy"
$ flux start -s2 -o,--conf=fluxion.toml
$ flux resource drain 0
$ flux submit -N2 hostname
ƒ2VGsWxto
$ flux cancel $(flux job last)
May 20 09:48:27.112761 sched-fluxion-qmanager.err[0]: jobmanager_cancel_cb: remove job (3284324581376): No such file or directory
$ flux jobs
JOBID USER NAME ST NTASKS NNODES TIME INFO
$ flux queue status -v
Job submission is enabled
Scheduling is started
0 alloc requests queued
1 alloc requests pending to scheduler
0 running jobs
$ exit
<hang>
cc @trws since it was surprising that jobs were not scheduled by Fluxion when we hit this issue.
Thanks @grondo, on my list for today to look into this. I think it should be solved by the change I pushed in the other day to deal with flux-framework/flux-sched#1208, but it's so easy for this to go wrong in unexpected ways I want to be completely sure.
Let's close this issue. I opened flux-framework/flux-sched#1210 for athe sched follow-up