flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

job-manager: problem with alloc queue on elcap

Open grondo opened this issue 1 year ago • 6 comments

On elcap no jobs are being scheduled and flux queue status -vvv shows:

0 alloc requests queued
2540 alloc requests pending to scheduler
34 running jobs

and after stopping all queues:

0 alloc requests queued
2530 alloc requests pending to scheduler
34 running jobs

The admins have been reconfiguring queues by reloading the scheduler and core resource modules. I wonder if this could be the cause.

I also cannot prove that this is the reason no jobs are being scheduled, but it seems likely.

Possibly related: #5964?

grondo avatar May 19 '24 22:05 grondo

Forgot to mention that the apparent real number of pending jobs is 25:

# flux jobs -Af pending | wc -l
25
# flux module stats job-manager 
{
 "journal": {
  "listeners": 1
 },
 "active_jobs": 57,
 "inactive_jobs": 16598,
 "max_jobid": 402543535884075008
}

grondo avatar May 19 '24 22:05 grondo

Actually I'm unsure that this is causing the scheduling issue because the job manager should be sending unlimited alloc requests to the scheduler. A bit stumped at this point.

grondo avatar May 19 '24 23:05 grondo

FYI - reloading fluxion modules resolved this situation:

0 alloc requests queued
1 alloc requests pending to scheduler
41 running jobs

grondo avatar May 19 '24 23:05 grondo

I was able to reproduce this on my test system with a node that was down:

  • fluxion with easy queue policy
  • multiple queues
  • node was down before flux started

To reproduce:

  • submit a job with --requires=host:X where X is the down node
  • cancel it
  • observe leaked alloc in flux queue status

Still probing to determine what of the above characteristics are actually required to reproduce. So far I've confirmed it does not reproduce in a sub instance with a drained node, no queues, and fcfs queue policy.

garlick avatar May 20 '24 14:05 garlick

Well it seems neither the --requires=host:X nor starting flux with a node down is required. It works to simply drain a node then submit a job that can't run without one more node, then cancel it.

 garlick@picl0:~$ sudo flux resource drain picl1
 garlick@picl0:~$ flux submit -N2 -q debug hostname
ƒ2deLurvY3R
 garlick@picl0:~$ flux jobs
       JOBID QUEUE    USER     NAME       ST NTASKS NNODES     TIME INFO
 ƒ2deLurvY3R debug    garlick  hostname    S      2      2      30s 
 garlick@picl0:~$ flux queue status -v
debug: Job submission is enabled
debug: Scheduling is started
all: Job submission is enabled
all: Scheduling is started
admin: Job submission is enabled
admin: Scheduling is started
batch: Job submission is enabled
batch: Scheduling is started
0 alloc requests queued
1 alloc requests pending to scheduler
0 running jobs
 garlick@picl0:~$ flux cancel $(flux job last)
 garlick@picl0:~$ flux jobs
       JOBID USER     NAME       ST NTASKS NNODES     TIME INFO
 garlick@picl0:~$ flux queue status -v
debug: Job submission is enabled
debug: Scheduling is started
all: Job submission is enabled
all: Scheduling is started
admin: Job submission is enabled
admin: Scheduling is started
batch: Job submission is enabled
batch: Scheduling is started
0 alloc requests queued
1 alloc requests pending to scheduler
0 running jobs

garlick avatar May 20 '24 16:05 garlick

OK it's trivially reproduceable in a standalone flux instance with no queues if the default queue policy of fcfs is changed to easy:

$ cat fluxion.toml
[sched-fluxion-qmanager]
queue-policy = "easy"
$ flux start -s2 -o,--conf=fluxion.toml
$ flux resource drain 0
$ flux submit -N2 hostname
ƒ2VGsWxto
$ flux cancel $(flux job last)
May 20 09:48:27.112761 sched-fluxion-qmanager.err[0]: jobmanager_cancel_cb: remove job (3284324581376): No such file or directory
$ flux jobs
       JOBID USER     NAME       ST NTASKS NNODES     TIME INFO
$ flux queue status -v
Job submission is enabled
Scheduling is started
0 alloc requests queued
1 alloc requests pending to scheduler
0 running jobs
$ exit
<hang>

garlick avatar May 20 '24 16:05 garlick

cc @trws since it was surprising that jobs were not scheduled by Fluxion when we hit this issue.

grondo avatar May 23 '24 15:05 grondo

Thanks @grondo, on my list for today to look into this. I think it should be solved by the change I pushed in the other day to deal with flux-framework/flux-sched#1208, but it's so easy for this to go wrong in unexpected ways I want to be completely sure.

trws avatar May 23 '24 15:05 trws

Let's close this issue. I opened flux-framework/flux-sched#1210 for athe sched follow-up

garlick avatar May 24 '24 17:05 garlick