hyperqueue icon indicating copy to clipboard operation
hyperqueue copied to clipboard

HyperQueue submission gets stuck if workers fail

Open svatosFZU opened this issue 1 year ago • 11 comments

For some time, I am observing this situation. HyperQueue works fine for some time and then something happens. Jobs are coming to the HyperQueue, are buffered there but no workers are submitted even though the allocation queue has backlog set to 10. The situation from this morning:

  • there are no jobs in the batch
[[email protected] ~]$ squeue --me
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
  • list of workers confirms that
[[email protected] ~]$ hq-v0.19.0-linux-x64/hq worker list
+----+-------+----------+-----------+---------+----------------+
| ID | State | Hostname | Resources | Manager | Manager Job ID |
+----+-------+----------+-----------+---------+----------------+
  • there is enough jobs buffered
[[email protected] ~]$ hq-v0.19.0-linux-x64/hq job list
+------+------+---------+-------+
| ID   | Name | State   | Tasks |
+------+------+---------+-------+
| 4565 | bash | WAITING | 1     |
| 4567 | bash | WAITING | 1     |
| 4573 | bash | WAITING | 1     |
| 4584 | bash | WAITING | 1     |
| 4596 | bash | WAITING | 1     |
| 4597 | bash | WAITING | 1     |
| 4600 | bash | WAITING | 1     |
| 4609 | bash | WAITING | 1     |
| 4618 | bash | WAITING | 1     |
| 4626 | bash | WAITING | 1     |
| 4634 | bash | WAITING | 1     |
| 4635 | bash | WAITING | 1     |
| 4636 | bash | WAITING | 1     |
| 4637 | bash | WAITING | 1     |
| 4638 | bash | WAITING | 1     |
| 4639 | bash | WAITING | 1     |
| 4640 | bash | WAITING | 1     |
| 4641 | bash | WAITING | 1     |
| 4642 | bash | WAITING | 1     |
| 4644 | bash | WAITING | 1     |
| 4645 | bash | WAITING | 1     |
| 4648 | bash | WAITING | 1     |
| 4651 | bash | WAITING | 1     |
| 4652 | bash | WAITING | 1     |
| 4653 | bash | WAITING | 1     |
| 4654 | bash | WAITING | 1     |
| 4655 | bash | WAITING | 1     |
| 4656 | bash | WAITING | 1     |
| 4657 | bash | WAITING | 1     |
| 4659 | bash | WAITING | 1     |
| 4660 | bash | WAITING | 1     |
| 4663 | bash | WAITING | 1     |
| 4664 | bash | WAITING | 1     |
| 4666 | bash | WAITING | 1     |
| 4668 | bash | WAITING | 1     |
| 4670 | bash | WAITING | 1     |
| 4674 | bash | WAITING | 1     |
| 4677 | bash | WAITING | 1     |
| 4679 | bash | WAITING | 1     |
| 4680 | bash | WAITING | 1     |
| 4682 | bash | WAITING | 1     |
| 4683 | bash | WAITING | 1     |
| 4684 | bash | WAITING | 1     |
| 4685 | bash | WAITING | 1     |
| 4688 | bash | WAITING | 1     |
| 4689 | bash | WAITING | 1     |
| 4690 | bash | WAITING | 1     |
| 4691 | bash | WAITING | 1     |
| 4692 | bash | WAITING | 1     |
| 4693 | bash | WAITING | 1     |
| 4694 | bash | WAITING | 1     |
| 4695 | bash | WAITING | 1     |
| 4696 | bash | WAITING | 1     |
| 4697 | bash | WAITING | 1     |
| 4698 | bash | WAITING | 1     |
| 4699 | bash | WAITING | 1     |
| 4700 | bash | WAITING | 1     |
| 4701 | bash | WAITING | 1     |
| 4702 | bash | WAITING | 1     |
| 4703 | bash | WAITING | 1     |
| 4704 | bash | WAITING | 1     |
| 4705 | bash | WAITING | 1     |
| 4706 | bash | WAITING | 1     |
| 4707 | bash | WAITING | 1     |
| 4708 | bash | WAITING | 1     |
| 4709 | bash | WAITING | 1     |
| 4710 | bash | WAITING | 1     |
+------+------+---------+-------+
There are 4710 jobs in total. Use `--all` to display all jobs.
  • the allocation queue is set with backlog of 10 and it is running
[[email protected] ~]$ hq-v0.19.0-linux-x64/hq alloc list
+----+---------+--------------+-------------------+------------------+-----------+---------+------+----------------------------------+
| ID | State   | Backlog size | Workers per alloc | Max worker count | Timelimit | Manager | Name | Args                             |
+----+---------+--------------+-------------------+------------------+-----------+---------+------+----------------------------------+
| 1  | RUNNING | 10           | 1                 | 0                | 12h       | SLURM   |      | -AOPEN-29-6,-pqcpu_preempt,-c128 |
+----+---------+--------------+-------------------+------------------+-----------+---------+------+----------------------------------+
  • the content of the debug log shows that HyperQueue cannot find workers it sent:
[2024-10-01T06:07:58.724Z DEBUG hyperqueue::server::autoalloc::process] Attempt to refresh allocations of queue 1
[2024-10-01T06:07:58.724Z DEBUG hyperqueue::server::autoalloc::queue::slurm] Running Slurm command `scontrol show job 1739716`
[2024-10-01T06:07:58.734Z DEBUG hyperqueue::server::autoalloc::queue::slurm] Running Slurm command `scontrol show job 1739715`
[2024-10-01T06:07:58.743Z DEBUG hyperqueue::server::autoalloc::queue::slurm] Running Slurm command `scontrol show job 1739718`
[2024-10-01T06:07:58.753Z DEBUG hyperqueue::server::autoalloc::queue::slurm] Running Slurm command `scontrol show job 1739717`
[2024-10-01T06:07:58.762Z DEBUG hyperqueue::server::autoalloc::queue::slurm] Running Slurm command `scontrol show job 1739711`
[2024-10-01T06:07:58.771Z DEBUG hyperqueue::server::autoalloc::queue::slurm] Running Slurm command `scontrol show job 1739702`
[2024-10-01T06:07:58.779Z DEBUG hyperqueue::server::autoalloc::queue::slurm] Running Slurm command `scontrol show job 1739704`
[2024-10-01T06:07:58.787Z DEBUG hyperqueue::server::autoalloc::queue::slurm] Running Slurm command `scontrol show job 1739710`
[2024-10-01T06:07:58.796Z DEBUG hyperqueue::server::autoalloc::queue::slurm] Running Slurm command `scontrol show job 1739700`
[2024-10-01T06:07:58.804Z DEBUG hyperqueue::server::autoalloc::queue::slurm] Running Slurm command `scontrol show job 1739712`
[2024-10-01T06:07:58.812Z DEBUG hyperqueue::server::autoalloc::process] Allocations of 1 have been refreshed: Ok(Map({"1739710": Err(scontrol execution failed
    
    Caused by:
        Exit code: 1
        Stderr: slurm_load_jobs error: Invalid job id specified
        Stdout: ), "1739717": Err(scontrol execution failed
    
    Caused by:
        Exit code: 1
        Stderr: slurm_load_jobs error: Invalid job id specified
        Stdout: ), "1739711": Err(scontrol execution failed
    
    Caused by:
        Exit code: 1
        Stderr: slurm_load_jobs error: Invalid job id specified
        Stdout: ), "1739700": Err(scontrol execution failed
    
    Caused by:
        Exit code: 1
        Stderr: slurm_load_jobs error: Invalid job id specified
        Stdout: ), "1739715": Err(scontrol execution failed
    
    Caused by:
        Exit code: 1
        Stderr: slurm_load_jobs error: Invalid job id specified
        Stdout: ), "1739704": Err(scontrol execution failed
    
    Caused by:
        Exit code: 1
        Stderr: slurm_load_jobs error: Invalid job id specified
        Stdout: ), "1739718": Err(scontrol execution failed
    Caused by:
        Exit code: 1
        Stderr: slurm_load_jobs error: Invalid job id specified
        Stdout: ), "1739702": Err(scontrol execution failed
    
    Caused by:
        Exit code: 1
        Stderr: slurm_load_jobs error: Invalid job id specified
        Stdout: ), "1739712": Err(scontrol execution failed
    
    Caused by:
        Exit code: 1
        Stderr: slurm_load_jobs error: Invalid job id specified
        Stdout: ), "1739716": Err(scontrol execution failed
    
    Caused by:
        Exit code: 1
        Stderr: slurm_load_jobs error: Invalid job id specified
        Stdout: )}))
[2024-10-01T06:07:58.812Z WARN  hyperqueue::server::autoalloc::process] Could not get status of allocation 1739716: scontrol execution failed
    
    Caused by:
        Exit code: 1
        Stderr: slurm_load_jobs error: Invalid job id specified
        Stdout: 
[2024-10-01T06:07:58.812Z WARN  hyperqueue::server::autoalloc::process] Could not get status of allocation 1739715: scontrol execution failed
    
    Caused by:
        Exit code: 1
        Stderr: slurm_load_jobs error: Invalid job id specified
        Stdout: 
[2024-10-01T06:07:58.812Z WARN  hyperqueue::server::autoalloc::process] Could not get status of allocation 1739718: scontrol execution failed
    
    Caused by:
        Exit code: 1
        Stderr: slurm_load_jobs error: Invalid job id specified
        Stdout: 
[2024-10-01T06:07:58.812Z WARN  hyperqueue::server::autoalloc::process] Could not get status of allocation 1739717: scontrol execution failed
    
    Caused by:
        Exit code: 1
        Stderr: slurm_load_jobs error: Invalid job id specified
        Stdout: 
[2024-10-01T06:07:58.812Z WARN  hyperqueue::server::autoalloc::process] Could not get status of allocation 1739711: scontrol execution failed
 
    Caused by:
        Exit code: 1
        Stderr: slurm_load_jobs error: Invalid job id specified
        Stdout: 
[2024-10-01T06:07:58.812Z WARN  hyperqueue::server::autoalloc::process] Could not get status of allocation 1739702: scontrol execution failed
    
    Caused by:
        Exit code: 1
        Stderr: slurm_load_jobs error: Invalid job id specified
        Stdout: 
[2024-10-01T06:07:58.812Z WARN  hyperqueue::server::autoalloc::process] Could not get status of allocation 1739704: scontrol execution failed
    
    Caused by:
        Exit code: 1
        Stderr: slurm_load_jobs error: Invalid job id specified
        Stdout: 
[2024-10-01T06:07:58.812Z WARN  hyperqueue::server::autoalloc::process] Could not get status of allocation 1739710: scontrol execution failed
    
    Caused by:
        Exit code: 1
        Stderr: slurm_load_jobs error: Invalid job id specified
        Stdout: 
[2024-10-01T06:07:58.812Z WARN  hyperqueue::server::autoalloc::process] Could not get status of allocation 1739700: scontrol execution failed
    
    Caused by:
        Exit code: 1
        Stderr: slurm_load_jobs error: Invalid job id specified
        Stdout: 
[2024-10-01T06:07:58.812Z WARN  hyperqueue::server::autoalloc::process] Could not get status of allocation 1739712: scontrol execution failed
    
    Caused by:
        Exit code: 1
        Stderr: slurm_load_jobs error: Invalid job id specified
        Stdout: 
[2024-10-01T06:07:58.812Z DEBUG hyperqueue::server::autoalloc::process] Task state: ServerTaskState { jobs: Map({JobId(4663): 1, JobId(4596): 1, JobId(4657): 1, JobId(4651): 1, JobId(4584): 1, JobId(4645): 1, JobId(4706): 1, JobId(4639): 1, JobId(4700): 1, JobId(4694): 1, JobId(4688): 1, JobId(4682): 1, JobId(4609): 1, JobId(4670): 1, JobId(4664): 1, JobId(4597): 1, JobId(4652): 1, JobId(4707): 1, JobId(4640): 1, JobId(4573): 1, JobId(4634): 1, JobId(4567): 1, JobId(4701): 1, JobId(4695): 1, JobId(4689): 1, JobId(4683): 1, JobId(4677): 1, JobId(4659): 1, JobId(4653): 1, JobId(4708): 1, JobId(4641): 1, JobId(4702): 1, JobId(4635): 1, JobId(4696): 1, JobId(4690): 1, JobId(4684): 1, JobId(4666): 1, JobId(4660): 1, JobId(4654): 1, JobId(4648): 1, JobId(4709): 1, JobId(4642): 1, JobId(4703): 1, JobId(4636): 1, JobId(4697): 1, JobId(4691): 1, JobId(4685): 1, JobId(4618): 1, JobId(4679): 1, JobId(4600): 1, JobId(4655): 1, JobId(4710): 1, JobId(4704): 1, JobId(4637): 1, JobId(4698): 1, JobId(4692): 1, JobId(4680): 1, JobId(4674): 1, JobId(4668): 1, JobId(4656): 1, JobId(4644): 1, JobId(4705): 1, JobId(4638): 1, JobId(4699): 1, JobId(4565): 1, JobId(4626): 1, JobId(4693): 1}) }, max. workers to spawn: 18446744073709551615
  • checking one job details (1739716), I see it quickly failed:
[[email protected] ~]$ sacct -j 1739716 --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize,Reqmem,nnodes,ncpus,nodelist
     User JobID           JobName  Partition      State  Timelimit               Start                 End    Elapsed     MaxRSS  MaxVMSize     ReqMem   NNodes      NCPUS        NodeList 
--------- ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ---------- ---------- ---------- ---------- -------- ---------- --------------- 
  svatosm 1739716       hq-1-2221 qcpu_pree+     FAILED   12:00:00 2024-09-30T07:10:19 2024-09-30T07:10:20   00:00:01                             230G        1        128           cn123 
          1739716.bat+      batch                FAILED            2024-09-30T07:10:19 2024-09-30T07:10:20   00:00:01          0      2136K                   1        128           cn123 
          1739716.ext+     extern             COMPLETED            2024-09-30T07:10:19 2024-09-30T07:10:20   00:00:01          0        52K                   1        128           cn123
  • The reason why the job failed so fast (according to the stderr), were Permission Denied errors on /tmp (i.e. something wrong with the node). But this should not stop/block the submission forever. Creating a new allocation queue makes the submission start instantly.

svatosFZU avatar Oct 01 '24 07:10 svatosFZU

Hi, sorry for the late reply, I'm quite busy at the moment. The automatic allocator has an internal rate limiter that can cause allocations to be stopped if too many failures occur. I'm not sure if that's what is causing this issue though. Are you still using HQ_AUTOALLOC_MAX_ALLOCATION_FAILS?

Kobzol avatar Oct 09 '24 19:10 Kobzol

Yes, the HQ server is started by

RUST_LOG=hyperqueue=debug HQ_AUTOALLOC_MAX_ALLOCATION_FAILS=100000 /home/svatosm/hq-v0.19.0-linux-x64/hq server start --journal /home/svatosm/hqJournal 2> /home/svatosm/hq-debug-output.log &

svatosFZU avatar Oct 09 '24 19:10 svatosFZU

Could you please send me the full log?

Kobzol avatar Oct 17 '24 06:10 Kobzol

Sure, log from that time period is at https://www.fzu.cz/~svatosm/hq-debug-output.log

svatosFZU avatar Oct 17 '24 06:10 svatosFZU

Thank you. It indeed looks like at least one of the issues is rate limiting. Except for the crash count, there is an additional rate limiter that kicks in after you have too many allocation or submission failures in a row. This rate limiter will then not allow further submissions until one hour has elapsed. This is clearly not ideal for the case of pre-emptible allocations.

One additional issue that I noticed in the log is that some of your tasks are hitting the crash limit. This is used to prevent re-running tasks that might crash their workers too many times. In your case, if workers fail often, the crash counter of tasks will be increased quite a lot, which can then fail these tasks.

Could you try setting the crash limit of your tasks to 0, to see if it helps with the issue?

Kobzol avatar Oct 19 '24 07:10 Kobzol

Hi, the second limiter is fine (at least for me) - one hour is no problem. The situation I have observed is that nothing is submitted at all for days. For the crash limit, all my jobs are submitted with --crash-limit=1 as re-running jobs is causing us problems (data corruption, etc.). So, I would like to try to avoid disabling this limit, as for us this is one of the main advantages of running jobs in HyperQueue.

svatosFZU avatar Oct 20 '24 17:10 svatosFZU

Btw: v0.20.0 does not increase crash limit when worker is explicitly stopped via "hq server stop", so it may happen that a task is restarted even --crash-limit=1 and we were thinking also about introducing other situations when crash limit is not icreased. Would it be useful for you to have a way that ensure no restart occurs for whatever reason?

spirali avatar Oct 20 '24 20:10 spirali

Well, I need to ensure that jobs run only once. The crash limit is the only way I know how to do that but if there is other way, I could switch to that.

svatosFZU avatar Oct 21 '24 05:10 svatosFZU

Crash counter is now the only was and if you are not using hq worker stop then it is also correct one. However our idea was that crash counter is designed for detecting tasks that crash workers and it may have slightly different semantics in the future. So I am collecting the use cases and in the future we will probably introduce something like --no-restart to maintain your usecase. Thank you and sorry for hijacking the discussion.

spirali avatar Oct 21 '24 06:10 spirali

Could you please specify the time period where the allocations weren't being created? I tried going through the log, and it seems that the allocations have been submitted at a steady state, e.g.:

[2024-10-03T11:42:27.541Z DEBUG hyperqueue::server::autoalloc::queue::common] Running command `sbatch /home/svatosm/.hq-server/001/autoalloc/2-praguelcg2/1363/hq-submit.sh`
[2024-10-03T11:42:27.553Z DEBUG hyperqueue::server::autoalloc::queue::common] Running command `sbatch /home/svatosm/.hq-server/001/autoalloc/2-praguelcg2/1364/hq-submit.sh`
[2024-10-03T11:42:27.817Z DEBUG hyperqueue::server::autoalloc::queue::common] Running command `sbatch /home/svatosm/.hq-server/001/autoalloc/2-praguelcg2/1365/hq-submit.sh`
[2024-10-03T11:42:27.830Z DEBUG hyperqueue::server::autoalloc::queue::common] Running command `sbatch /home/svatosm/.hq-server/001/autoalloc/2-praguelcg2/1366/hq-submit.sh`

It also seems that new workers were being added and task were being started periodically.

Kobzol avatar Oct 21 '24 11:10 Kobzol

Well, when I submitted this ticket, I was referring to time period between around 3 a 9 (when I made a new allocation queue) on 1.10.2024.

svatosFZU avatar Oct 21 '24 19:10 svatosFZU

If it helps, I can provide another example. In this log:

https://www.fzu.cz/~svatosm/hq-debug-output.log.1

HQ does not submit batch jobs since the middle of October.

svatosFZU avatar Nov 08 '24 13:11 svatosFZU

It seems like the second allocation queue has indeed stopped submitting tasks, even though there are some waiting tasks. I have a theory about what causes the problem. The only remaining queue (queue 2) has a backlog of 1, therefore it won't consider to add new allocations if there is at least one allocation queued. It seems like the 1795705 allocation was last reported as being queued, but then, when HQ was trying to figure out its state, SLURM errors with this:

Could not get status of allocation 1795705: scontrol execution failed
    
    Caused by:
        Exit code: 1
        Stderr: slurm_load_jobs error: Invalid job id specified

for some reason. Since SLURM does not change its output for this allocation forever (it also did this for a few others allocations, not really sure why), then HQ gets stuck, because it doesn't get any response from SLURM, so it will forever think that this allocation is queued/running. We submitted this allocation, but no worker ever connected from it, and SLURM always returns an error for it.. weird behavior, but we should be resilient against it anyway.

I will try to improve the logic so that when SLURM returns an error when getting the state of an allocation, we will consider that allocation to be lost.

Kobzol avatar Jan 10 '25 11:01 Kobzol

Thanks, but would you also implement some kind of state caching to avoid loss on fluctuations? I mean, if the slurm commands were just briefly timing out, it would loose all jobs, wouldn't it?

svatosFZU avatar Jan 10 '25 11:01 svatosFZU

Indeed, issues might be transient. I will try to experiment with using something like "count the errors, if it errored X times in a row, consider the allocation to be dead".

Kobzol avatar Jan 10 '25 11:01 Kobzol

Thanks. I assume this is now in nightly builds. As I have couple of other issues resolved, I would like to ask how far are we from version 0.21 (so I could get them all at once)?

svatosFZU avatar Jan 13 '25 09:01 svatosFZU

My plan is to make rc1 release tomorrow and if no problem is reported then v0.21 will be out at the end of the week.

spirali avatar Jan 13 '25 10:01 spirali

Version v0.21.0 released: https://github.com/It4innovations/hyperqueue/releases/tag/v0.21.0

spirali avatar Jan 16 '25 09:01 spirali

Well, I need to ensure that jobs run only once. The crash limit is the only way I know how to do that but if there is other way, I could switch to that.

#917 addds a new option --crash-limit=never-restart or just --crash-limit=n that instructs HQ to never restarts the task. --crash-limit=1 is similar, but --crash-limit=1 may restart the task in cases when the crash counter is not increased (e.g. you explicitly stop the worker, or worker's time limit is reached).

spirali avatar May 30 '25 13:05 spirali

Excellent, thanks. Now, my usual question: should I use the nightly or wait for a new release?

svatosFZU avatar Jun 02 '25 08:06 svatosFZU

The PR has been merged just now, so it will be available in tomorrow's nightly. At this moment we are not aware of any regressions, so it should be relatively safe to use nightly.

spirali avatar Jun 02 '25 12:06 spirali