HyperQueue submission gets stuck if workers fail
For some time, I am observing this situation. HyperQueue works fine for some time and then something happens. Jobs are coming to the HyperQueue, are buffered there but no workers are submitted even though the allocation queue has backlog set to 10. The situation from this morning:
- there are no jobs in the batch
[[email protected] ~]$ squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
- list of workers confirms that
[[email protected] ~]$ hq-v0.19.0-linux-x64/hq worker list
+----+-------+----------+-----------+---------+----------------+
| ID | State | Hostname | Resources | Manager | Manager Job ID |
+----+-------+----------+-----------+---------+----------------+
- there is enough jobs buffered
[[email protected] ~]$ hq-v0.19.0-linux-x64/hq job list
+------+------+---------+-------+
| ID | Name | State | Tasks |
+------+------+---------+-------+
| 4565 | bash | WAITING | 1 |
| 4567 | bash | WAITING | 1 |
| 4573 | bash | WAITING | 1 |
| 4584 | bash | WAITING | 1 |
| 4596 | bash | WAITING | 1 |
| 4597 | bash | WAITING | 1 |
| 4600 | bash | WAITING | 1 |
| 4609 | bash | WAITING | 1 |
| 4618 | bash | WAITING | 1 |
| 4626 | bash | WAITING | 1 |
| 4634 | bash | WAITING | 1 |
| 4635 | bash | WAITING | 1 |
| 4636 | bash | WAITING | 1 |
| 4637 | bash | WAITING | 1 |
| 4638 | bash | WAITING | 1 |
| 4639 | bash | WAITING | 1 |
| 4640 | bash | WAITING | 1 |
| 4641 | bash | WAITING | 1 |
| 4642 | bash | WAITING | 1 |
| 4644 | bash | WAITING | 1 |
| 4645 | bash | WAITING | 1 |
| 4648 | bash | WAITING | 1 |
| 4651 | bash | WAITING | 1 |
| 4652 | bash | WAITING | 1 |
| 4653 | bash | WAITING | 1 |
| 4654 | bash | WAITING | 1 |
| 4655 | bash | WAITING | 1 |
| 4656 | bash | WAITING | 1 |
| 4657 | bash | WAITING | 1 |
| 4659 | bash | WAITING | 1 |
| 4660 | bash | WAITING | 1 |
| 4663 | bash | WAITING | 1 |
| 4664 | bash | WAITING | 1 |
| 4666 | bash | WAITING | 1 |
| 4668 | bash | WAITING | 1 |
| 4670 | bash | WAITING | 1 |
| 4674 | bash | WAITING | 1 |
| 4677 | bash | WAITING | 1 |
| 4679 | bash | WAITING | 1 |
| 4680 | bash | WAITING | 1 |
| 4682 | bash | WAITING | 1 |
| 4683 | bash | WAITING | 1 |
| 4684 | bash | WAITING | 1 |
| 4685 | bash | WAITING | 1 |
| 4688 | bash | WAITING | 1 |
| 4689 | bash | WAITING | 1 |
| 4690 | bash | WAITING | 1 |
| 4691 | bash | WAITING | 1 |
| 4692 | bash | WAITING | 1 |
| 4693 | bash | WAITING | 1 |
| 4694 | bash | WAITING | 1 |
| 4695 | bash | WAITING | 1 |
| 4696 | bash | WAITING | 1 |
| 4697 | bash | WAITING | 1 |
| 4698 | bash | WAITING | 1 |
| 4699 | bash | WAITING | 1 |
| 4700 | bash | WAITING | 1 |
| 4701 | bash | WAITING | 1 |
| 4702 | bash | WAITING | 1 |
| 4703 | bash | WAITING | 1 |
| 4704 | bash | WAITING | 1 |
| 4705 | bash | WAITING | 1 |
| 4706 | bash | WAITING | 1 |
| 4707 | bash | WAITING | 1 |
| 4708 | bash | WAITING | 1 |
| 4709 | bash | WAITING | 1 |
| 4710 | bash | WAITING | 1 |
+------+------+---------+-------+
There are 4710 jobs in total. Use `--all` to display all jobs.
- the allocation queue is set with backlog of 10 and it is running
[[email protected] ~]$ hq-v0.19.0-linux-x64/hq alloc list
+----+---------+--------------+-------------------+------------------+-----------+---------+------+----------------------------------+
| ID | State | Backlog size | Workers per alloc | Max worker count | Timelimit | Manager | Name | Args |
+----+---------+--------------+-------------------+------------------+-----------+---------+------+----------------------------------+
| 1 | RUNNING | 10 | 1 | 0 | 12h | SLURM | | -AOPEN-29-6,-pqcpu_preempt,-c128 |
+----+---------+--------------+-------------------+------------------+-----------+---------+------+----------------------------------+
- the content of the debug log shows that HyperQueue cannot find workers it sent:
[2024-10-01T06:07:58.724Z DEBUG hyperqueue::server::autoalloc::process] Attempt to refresh allocations of queue 1
[2024-10-01T06:07:58.724Z DEBUG hyperqueue::server::autoalloc::queue::slurm] Running Slurm command `scontrol show job 1739716`
[2024-10-01T06:07:58.734Z DEBUG hyperqueue::server::autoalloc::queue::slurm] Running Slurm command `scontrol show job 1739715`
[2024-10-01T06:07:58.743Z DEBUG hyperqueue::server::autoalloc::queue::slurm] Running Slurm command `scontrol show job 1739718`
[2024-10-01T06:07:58.753Z DEBUG hyperqueue::server::autoalloc::queue::slurm] Running Slurm command `scontrol show job 1739717`
[2024-10-01T06:07:58.762Z DEBUG hyperqueue::server::autoalloc::queue::slurm] Running Slurm command `scontrol show job 1739711`
[2024-10-01T06:07:58.771Z DEBUG hyperqueue::server::autoalloc::queue::slurm] Running Slurm command `scontrol show job 1739702`
[2024-10-01T06:07:58.779Z DEBUG hyperqueue::server::autoalloc::queue::slurm] Running Slurm command `scontrol show job 1739704`
[2024-10-01T06:07:58.787Z DEBUG hyperqueue::server::autoalloc::queue::slurm] Running Slurm command `scontrol show job 1739710`
[2024-10-01T06:07:58.796Z DEBUG hyperqueue::server::autoalloc::queue::slurm] Running Slurm command `scontrol show job 1739700`
[2024-10-01T06:07:58.804Z DEBUG hyperqueue::server::autoalloc::queue::slurm] Running Slurm command `scontrol show job 1739712`
[2024-10-01T06:07:58.812Z DEBUG hyperqueue::server::autoalloc::process] Allocations of 1 have been refreshed: Ok(Map({"1739710": Err(scontrol execution failed
Caused by:
Exit code: 1
Stderr: slurm_load_jobs error: Invalid job id specified
Stdout: ), "1739717": Err(scontrol execution failed
Caused by:
Exit code: 1
Stderr: slurm_load_jobs error: Invalid job id specified
Stdout: ), "1739711": Err(scontrol execution failed
Caused by:
Exit code: 1
Stderr: slurm_load_jobs error: Invalid job id specified
Stdout: ), "1739700": Err(scontrol execution failed
Caused by:
Exit code: 1
Stderr: slurm_load_jobs error: Invalid job id specified
Stdout: ), "1739715": Err(scontrol execution failed
Caused by:
Exit code: 1
Stderr: slurm_load_jobs error: Invalid job id specified
Stdout: ), "1739704": Err(scontrol execution failed
Caused by:
Exit code: 1
Stderr: slurm_load_jobs error: Invalid job id specified
Stdout: ), "1739718": Err(scontrol execution failed
Caused by:
Exit code: 1
Stderr: slurm_load_jobs error: Invalid job id specified
Stdout: ), "1739702": Err(scontrol execution failed
Caused by:
Exit code: 1
Stderr: slurm_load_jobs error: Invalid job id specified
Stdout: ), "1739712": Err(scontrol execution failed
Caused by:
Exit code: 1
Stderr: slurm_load_jobs error: Invalid job id specified
Stdout: ), "1739716": Err(scontrol execution failed
Caused by:
Exit code: 1
Stderr: slurm_load_jobs error: Invalid job id specified
Stdout: )}))
[2024-10-01T06:07:58.812Z WARN hyperqueue::server::autoalloc::process] Could not get status of allocation 1739716: scontrol execution failed
Caused by:
Exit code: 1
Stderr: slurm_load_jobs error: Invalid job id specified
Stdout:
[2024-10-01T06:07:58.812Z WARN hyperqueue::server::autoalloc::process] Could not get status of allocation 1739715: scontrol execution failed
Caused by:
Exit code: 1
Stderr: slurm_load_jobs error: Invalid job id specified
Stdout:
[2024-10-01T06:07:58.812Z WARN hyperqueue::server::autoalloc::process] Could not get status of allocation 1739718: scontrol execution failed
Caused by:
Exit code: 1
Stderr: slurm_load_jobs error: Invalid job id specified
Stdout:
[2024-10-01T06:07:58.812Z WARN hyperqueue::server::autoalloc::process] Could not get status of allocation 1739717: scontrol execution failed
Caused by:
Exit code: 1
Stderr: slurm_load_jobs error: Invalid job id specified
Stdout:
[2024-10-01T06:07:58.812Z WARN hyperqueue::server::autoalloc::process] Could not get status of allocation 1739711: scontrol execution failed
Caused by:
Exit code: 1
Stderr: slurm_load_jobs error: Invalid job id specified
Stdout:
[2024-10-01T06:07:58.812Z WARN hyperqueue::server::autoalloc::process] Could not get status of allocation 1739702: scontrol execution failed
Caused by:
Exit code: 1
Stderr: slurm_load_jobs error: Invalid job id specified
Stdout:
[2024-10-01T06:07:58.812Z WARN hyperqueue::server::autoalloc::process] Could not get status of allocation 1739704: scontrol execution failed
Caused by:
Exit code: 1
Stderr: slurm_load_jobs error: Invalid job id specified
Stdout:
[2024-10-01T06:07:58.812Z WARN hyperqueue::server::autoalloc::process] Could not get status of allocation 1739710: scontrol execution failed
Caused by:
Exit code: 1
Stderr: slurm_load_jobs error: Invalid job id specified
Stdout:
[2024-10-01T06:07:58.812Z WARN hyperqueue::server::autoalloc::process] Could not get status of allocation 1739700: scontrol execution failed
Caused by:
Exit code: 1
Stderr: slurm_load_jobs error: Invalid job id specified
Stdout:
[2024-10-01T06:07:58.812Z WARN hyperqueue::server::autoalloc::process] Could not get status of allocation 1739712: scontrol execution failed
Caused by:
Exit code: 1
Stderr: slurm_load_jobs error: Invalid job id specified
Stdout:
[2024-10-01T06:07:58.812Z DEBUG hyperqueue::server::autoalloc::process] Task state: ServerTaskState { jobs: Map({JobId(4663): 1, JobId(4596): 1, JobId(4657): 1, JobId(4651): 1, JobId(4584): 1, JobId(4645): 1, JobId(4706): 1, JobId(4639): 1, JobId(4700): 1, JobId(4694): 1, JobId(4688): 1, JobId(4682): 1, JobId(4609): 1, JobId(4670): 1, JobId(4664): 1, JobId(4597): 1, JobId(4652): 1, JobId(4707): 1, JobId(4640): 1, JobId(4573): 1, JobId(4634): 1, JobId(4567): 1, JobId(4701): 1, JobId(4695): 1, JobId(4689): 1, JobId(4683): 1, JobId(4677): 1, JobId(4659): 1, JobId(4653): 1, JobId(4708): 1, JobId(4641): 1, JobId(4702): 1, JobId(4635): 1, JobId(4696): 1, JobId(4690): 1, JobId(4684): 1, JobId(4666): 1, JobId(4660): 1, JobId(4654): 1, JobId(4648): 1, JobId(4709): 1, JobId(4642): 1, JobId(4703): 1, JobId(4636): 1, JobId(4697): 1, JobId(4691): 1, JobId(4685): 1, JobId(4618): 1, JobId(4679): 1, JobId(4600): 1, JobId(4655): 1, JobId(4710): 1, JobId(4704): 1, JobId(4637): 1, JobId(4698): 1, JobId(4692): 1, JobId(4680): 1, JobId(4674): 1, JobId(4668): 1, JobId(4656): 1, JobId(4644): 1, JobId(4705): 1, JobId(4638): 1, JobId(4699): 1, JobId(4565): 1, JobId(4626): 1, JobId(4693): 1}) }, max. workers to spawn: 18446744073709551615
- checking one job details (1739716), I see it quickly failed:
[[email protected] ~]$ sacct -j 1739716 --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize,Reqmem,nnodes,ncpus,nodelist
User JobID JobName Partition State Timelimit Start End Elapsed MaxRSS MaxVMSize ReqMem NNodes NCPUS NodeList
--------- ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ---------- ---------- ---------- ---------- -------- ---------- ---------------
svatosm 1739716 hq-1-2221 qcpu_pree+ FAILED 12:00:00 2024-09-30T07:10:19 2024-09-30T07:10:20 00:00:01 230G 1 128 cn123
1739716.bat+ batch FAILED 2024-09-30T07:10:19 2024-09-30T07:10:20 00:00:01 0 2136K 1 128 cn123
1739716.ext+ extern COMPLETED 2024-09-30T07:10:19 2024-09-30T07:10:20 00:00:01 0 52K 1 128 cn123
- The reason why the job failed so fast (according to the stderr), were Permission Denied errors on /tmp (i.e. something wrong with the node). But this should not stop/block the submission forever. Creating a new allocation queue makes the submission start instantly.
Hi, sorry for the late reply, I'm quite busy at the moment. The automatic allocator has an internal rate limiter that can cause allocations to be stopped if too many failures occur. I'm not sure if that's what is causing this issue though. Are you still using HQ_AUTOALLOC_MAX_ALLOCATION_FAILS?
Yes, the HQ server is started by
RUST_LOG=hyperqueue=debug HQ_AUTOALLOC_MAX_ALLOCATION_FAILS=100000 /home/svatosm/hq-v0.19.0-linux-x64/hq server start --journal /home/svatosm/hqJournal 2> /home/svatosm/hq-debug-output.log &
Could you please send me the full log?
Sure, log from that time period is at https://www.fzu.cz/~svatosm/hq-debug-output.log
Thank you. It indeed looks like at least one of the issues is rate limiting. Except for the crash count, there is an additional rate limiter that kicks in after you have too many allocation or submission failures in a row. This rate limiter will then not allow further submissions until one hour has elapsed. This is clearly not ideal for the case of pre-emptible allocations.
One additional issue that I noticed in the log is that some of your tasks are hitting the crash limit. This is used to prevent re-running tasks that might crash their workers too many times. In your case, if workers fail often, the crash counter of tasks will be increased quite a lot, which can then fail these tasks.
Could you try setting the crash limit of your tasks to 0, to see if it helps with the issue?
Hi,
the second limiter is fine (at least for me) - one hour is no problem. The situation I have observed is that nothing is submitted at all for days.
For the crash limit, all my jobs are submitted with --crash-limit=1 as re-running jobs is causing us problems (data corruption, etc.). So, I would like to try to avoid disabling this limit, as for us this is one of the main advantages of running jobs in HyperQueue.
Btw: v0.20.0 does not increase crash limit when worker is explicitly stopped via "hq server stop", so it may happen that a task is restarted even --crash-limit=1 and we were thinking also about introducing other situations when crash limit is not icreased. Would it be useful for you to have a way that ensure no restart occurs for whatever reason?
Well, I need to ensure that jobs run only once. The crash limit is the only way I know how to do that but if there is other way, I could switch to that.
Crash counter is now the only was and if you are not using hq worker stop then it is also correct one.
However our idea was that crash counter is designed for detecting tasks that crash workers and it may have slightly different semantics in the future. So I am collecting the use cases and in the future we will probably introduce something like --no-restart to maintain your usecase. Thank you and sorry for hijacking the discussion.
Could you please specify the time period where the allocations weren't being created? I tried going through the log, and it seems that the allocations have been submitted at a steady state, e.g.:
[2024-10-03T11:42:27.541Z DEBUG hyperqueue::server::autoalloc::queue::common] Running command `sbatch /home/svatosm/.hq-server/001/autoalloc/2-praguelcg2/1363/hq-submit.sh`
[2024-10-03T11:42:27.553Z DEBUG hyperqueue::server::autoalloc::queue::common] Running command `sbatch /home/svatosm/.hq-server/001/autoalloc/2-praguelcg2/1364/hq-submit.sh`
[2024-10-03T11:42:27.817Z DEBUG hyperqueue::server::autoalloc::queue::common] Running command `sbatch /home/svatosm/.hq-server/001/autoalloc/2-praguelcg2/1365/hq-submit.sh`
[2024-10-03T11:42:27.830Z DEBUG hyperqueue::server::autoalloc::queue::common] Running command `sbatch /home/svatosm/.hq-server/001/autoalloc/2-praguelcg2/1366/hq-submit.sh`
It also seems that new workers were being added and task were being started periodically.
Well, when I submitted this ticket, I was referring to time period between around 3 a 9 (when I made a new allocation queue) on 1.10.2024.
If it helps, I can provide another example. In this log:
https://www.fzu.cz/~svatosm/hq-debug-output.log.1
HQ does not submit batch jobs since the middle of October.
It seems like the second allocation queue has indeed stopped submitting tasks, even though there are some waiting tasks. I have a theory about what causes the problem. The only remaining queue (queue 2) has a backlog of 1, therefore it won't consider to add new allocations if there is at least one allocation queued. It seems like the 1795705 allocation was last reported as being queued, but then, when HQ was trying to figure out its state, SLURM errors with this:
Could not get status of allocation 1795705: scontrol execution failed
Caused by:
Exit code: 1
Stderr: slurm_load_jobs error: Invalid job id specified
for some reason. Since SLURM does not change its output for this allocation forever (it also did this for a few others allocations, not really sure why), then HQ gets stuck, because it doesn't get any response from SLURM, so it will forever think that this allocation is queued/running. We submitted this allocation, but no worker ever connected from it, and SLURM always returns an error for it.. weird behavior, but we should be resilient against it anyway.
I will try to improve the logic so that when SLURM returns an error when getting the state of an allocation, we will consider that allocation to be lost.
Thanks, but would you also implement some kind of state caching to avoid loss on fluctuations? I mean, if the slurm commands were just briefly timing out, it would loose all jobs, wouldn't it?
Indeed, issues might be transient. I will try to experiment with using something like "count the errors, if it errored X times in a row, consider the allocation to be dead".
Thanks. I assume this is now in nightly builds. As I have couple of other issues resolved, I would like to ask how far are we from version 0.21 (so I could get them all at once)?
My plan is to make rc1 release tomorrow and if no problem is reported then v0.21 will be out at the end of the week.
Version v0.21.0 released: https://github.com/It4innovations/hyperqueue/releases/tag/v0.21.0
Well, I need to ensure that jobs run only once. The crash limit is the only way I know how to do that but if there is other way, I could switch to that.
#917 addds a new option --crash-limit=never-restart or just --crash-limit=n that instructs HQ to never restarts the task. --crash-limit=1 is similar, but --crash-limit=1 may restart the task in cases when the crash counter is not increased (e.g. you explicitly stop the worker, or worker's time limit is reached).
Excellent, thanks. Now, my usual question: should I use the nightly or wait for a new release?
The PR has been merged just now, so it will be available in tomorrow's nightly. At this moment we are not aware of any regressions, so it should be relatively safe to use nightly.