hyperqueue Automatic allocation behavior clarification / feature request

Running hyperqueue 0.12.0 (prebuilt downloaded from github) using slurm 21.08.7 on "Red Hat Enterprise Linux Server 7.9 "

Is it intended behavior that a HQ job submission will trigger a new allocation even tough there is sufficient resources ( running workers) already available, or is this some sort of bug?

I.e if I create an allocation with --workers-per-alloc 2 --backlog 1 --max-worker-count 100 and submit a job, I will get an allocation each time I call hq submit. Possible that I might have miss-configured, I'm a bit uncertain about the backlog flag, and e.g setting that to 0 leads to a allocation which does not start any jobs.

It might be a bit out of scope of what HQ is trying to do, but It would extremely useful if it were possible for new workers to only start when there are more jobs/tasks than currently "active" resources. This way HQ could be used as an extremely dynamic scheduling engine sitting between the system batch scheduler and a very dynamic workflow manager. Kind of a perfect middle ground between workflow managers launching every task as separate batch job and running HQ + workflow manager inside a single batch allocation.

Aug 30 '22 13:08 Nortamo

Hi :) So there are a few things to unpack here.

Possible that I might have miss-configured, I'm a bit uncertain about the backlog flag, and e.g setting that to 0 leads to a allocation which does not start any jobs.

Using a backlog of 0 should be a hard error, I will fix that, thanks.

I will get an allocation each time I call hq submit

This happens because submitting new tasks "wakes" the automatic allocator. But even if there was no submit, the new allocations would probably be submitted after some time anyway.

Kind of a perfect middle ground between workflow managers launching every task as separate batch job and running HQ + workflow manager inside a single batch allocation.

This is definitely not out of scope for HQ, in fact this is one of the primary goals of HyperQueue, to be this kind of a tool. Automatic allocation is not required for that, but it's of course a big part of the ergonomics of using it.

However, being perfect here is very hard, since predicting the allocation requirements is basically an NP-hard problem where we have to somehow predict the future. The current implementation is actually quite basic, and doesn't do a lot of prediction, which is why it can sometimes allocate needlessly. On the other hand, there are situations where it's not really possible to do a better job.

Consider that we submit 100 tasks and an allocation queue that spawns 1 hour allocations with a worker with 10 cores. When I'm in the situation that an allocation starts and I decide what to do next, I see that a single worker is currently computing 10 tasks and there are 90 tasks left. What is the best thing to do here? If the tasks take 30 seconds, we can just leave the worker to compute them and not create any new allocations. But what if each task takes 10 minutes? Then we should create a new allocation, so that we can spawn new workers and finish the workload as fast as possible. Without knowing the task durations, this is basically unsolvable optimally.

That was just an example, but in general it's quite difficult to solve this problem optimally. Knowing the time limits and resource requirements of individual tasks would help, but there is still a lot of uncertainty in the job manager (PBS/Slurm) itself, as we do not know how long will the allocation sit in the queue. Maybe it will take several hours, and since then new tasks will appear, so maybe it was a good idea to preallocate all those allocations!

Our goal is to take more information about tasks into account (like task limits, resource requirements etc.) and improve the heuristics of the automatic allocator. But there will always be cases where it will do the Bad Thing™. The current implementation is quite eager and tries to allocate a lot, to avoid starvation. If the allocations don't have anything to do, they will turn themselves off after 5 minutes.

We can try to debug your specific situation if you can send us more information about your use-case, i.e. how many tasks did you submit, how many workers and with what resources were online where new allocations that you didn't expect were submitted, what were other parameters of the allocation queue (time limit) etc.?

Aug 30 '22 14:08 Kobzol

Thanks for the very detailed clarification.

Create an allocation queue with hq alloc add slurm --cpus=2x20 --time-limit 4h --workers-per-alloc 2 --backlog 1 --max-worker-count 10 -- -A <slurm_account> -c 40 -p large

Then submit a single job with hq submit --log=/dev/null --time-limit=10min --cpus=1 --resource mem=1 ls This will start 2 jobs / 4 workers, with the state after job completion being:

$ ./hq job list --all
+----+------+----------+-------+
| ID | Name | State    | Tasks |
+----+------+----------+-------+
|  1 | ls   | FINISHED | 1     |
+----+------+----------+-------+
$ ./hq worker list
+----+---------+--------------+---------------------------+---------+----------------+
| ID | State   | Hostname     | Resources                 | Manager | Manager Job ID |
+----+---------+--------------+---------------------------+---------+----------------+
|  1 | RUNNING | r05c06.bullx | 2x20 cpus; mem 187.06 GiB | SLURM   | 13051337       |
|  2 | RUNNING | r05c07.bullx | 2x20 cpus; mem 187.06 GiB | SLURM   | 13051337       |
|  3 | RUNNING | r13c44.bullx | 2x20 cpus; mem 187.06 GiB | SLURM   | 13051338       |
|  4 | RUNNING | r13c45.bullx | 2x20 cpus; mem 187.06 GiB | SLURM   | 13051338       |
+----+---------+--------------+---------------------------+---------+----------------+

Submitting an identical job again with hq submit --log=/dev/null --time-limit=10min --cpus=1 --resource mem=1 ls spawns an additional allocation, and each subsequent submissions spawns a new allocation even if there are no jobs running and all the workers are free.

Job submitted successfully, job ID: 2
$ 2022-08-31T10:25:29Z INFO Queued 2 worker(s) into queue 1: 13051419
2022-08-31T10:25:30Z INFO Worker 5 registered from 10.140.3.105:55106
2022-08-31T10:25:30Z INFO Worker 6 registered from 10.140.3.106:58848
$ ./hq worker list
+----+---------+--------------+---------------------------+---------+----------------+
| ID | State   | Hostname     | Resources                 | Manager | Manager Job ID |
+----+---------+--------------+---------------------------+---------+----------------+
|  1 | RUNNING | r05c06.bullx | 2x20 cpus; mem 187.06 GiB | SLURM   | 13051337       |
|  2 | RUNNING | r05c07.bullx | 2x20 cpus; mem 187.06 GiB | SLURM   | 13051337       |
|  3 | RUNNING | r13c44.bullx | 2x20 cpus; mem 187.06 GiB | SLURM   | 13051338       |
|  4 | RUNNING | r13c45.bullx | 2x20 cpus; mem 187.06 GiB | SLURM   | 13051338       |
|  5 | RUNNING | r15c36.bullx | 2x20 cpus; mem 187.06 GiB | SLURM   | 13051339       |
|  6 | RUNNING | r15c37.bullx | 2x20 cpus; mem 187.06 GiB | SLURM   | 13051339       |
+----+---------+--------------+---------------------------+---------+----------------+

Aug 31 '22 10:08 Nortamo

Ok, this definitely sounds suspicious and unintended. I will try to simulate this situation and see if I can fix it.

Aug 31 '22 11:08 Kobzol

I implemented a change that should fix the behaviour in this situation. However, we will need to make larger changes to the autoallocator to make it more robust and "smarter", to resolve similar situations. These changes should hopefully appear in 0.13.

Sep 07 '22 09:09 Kobzol