hyperqueue Support multi-node tasks

(Note: maybe I'm completely misunderstanding how multiple allocs are supposed to be used - so please tell me if this can be solved in a different way!)

I have two usecases and I didn't understand if they are already supported (or they work already by in some way I don't know)

on some of our supercomputers at the Swiss supercomputing centre CSCS, we have two partitions, one with CPU only and one with GPUs. Therefore, I'd like to start two allocs, one for each (e.g. -C mc vs. -C gpu), give to each of them a name, and when I submit a job to hq, specify to which of the two they should run. Is this possible already? (The docs state that the name is only used for debug reasons)
If I understand correctly, when I start the alloc, I specify all parameters, including how many nodes each worker should have. Is there a way to start automatically workers with different number of nodes? E.g. if I have a machine with nodes with 128 cores, I might want to have jobs with e.g. 16, 32, 64 or 128 cores run in workers that use 1 node; but I might also have e.g. jobs that I want to run on 2 or 4 nodes, and for these I'd like to also have some workers run on 2 or 4 nodes. At the same time, I don't want the jobs with <= 128 cores to run on these workers but only on the 1-node workers. Can this be already achieved by creating 3 allocs with 1, 2, 4 nodes respectively? I tried to do this but I saw some strange result that I couldn't debug properly or test extensively yet (I think I had only the 1-node alloc first, I submitted a 2-node job to hq and I think that hq started submitting workers, when they entered the SLURM queue it realized they weren't OK for the job and it started submitting more; I then created also a 2-node alloc but it didn't enter the queue for quite some time because of the priorities of the machine, and in the meantime I kept having 1-node workers being submitted, starting, and dying after the 5-minutes idle timeout, and new ones getting launched again). For this usecase, I would either have something like the point 1 above (I can decide to send a job only to one specific alloc and avoid that the ones that are not compatible keep starting, timing out and restart), or some automatic way for hq to understand how many nodes a job requires, and send to the correct alloc

Dec 11 '21 22:12 giovannipizzi

This is definitely a valid use case and we intend to support it. Currently, this can be achieved in a semi-manual way.

You can assign resources to specific workers and jobs, e.g. you can say that worker A has the GPU resource and that job 1 needs 1 GPU in order to execute. Currently, numeric resources are implemented, while a boolean one (does have/doesn't have GPU) might be more natural here. We will introduce a shortcut for such boolean resources to the CLI.

However, currently you can only specify these resources only when starting a worker manually. We also intend to let the user choose the resource when using automatic allocation.
I'm not sure whether you're talking about multi-node PBS/Slurm allocations or multi-node MPI computations here. HyperQueue currently does not support multi-node tasks, which means that a single task (currently a single executed binary) will always run on a single worker on a single node. It could in theory start a multi-node MPI computation, but this will not work as expected, even if the allocation had multiple nodes, the workers on other nodes will not know about this computation and it would most probably result in over-subscription.

On the other hand, you can tell the auto allocator to create PBS/Slurm allocations with multiple nodes, but all it will do is to start a single independent worker on each allocated node. It's basically just a tuning parameter that can be used e.g. if you know that it will benefit queue wait times or otherwise overcome some PBS/Slurm limits. Otherwise HQ will just queue N single-node allocations in order to start the allocations as soon as possible.

Dec 13 '21 09:12 Kobzol

@giovannipizzi In general, multi-node ~~allocations~~ tasks are on our roadmap and we definitely want to include them in HQ, but it does not have a top priority for us right now.

Dec 13 '21 10:12 spirali

@giovannipizzi I am sorry, I used a wrong term in my previous text. We are supporting multinode (PBS/SLURM) allocations. What we are not supporting right now is multi-node tasks.

Dec 13 '21 10:12 spirali

Ah, I see! I didn't manage to test multi-node allocations yet (I had queued a few but then I stopped them because it didn't enter the queue as quick as I had wanted ;-) ). So, I didn't realise that a 2-node alloc starts 2 1-node workers in it.

Do I therefore understand correctly that, at the moment, the best is to have a filter before submitting: if it's a 1-node job (or smaller than one) one submits to hq, if it's larger one submits directly to the scheduler (but then clearly will not benefit from running multiple shorter jobs within the same SLURM job?).

For now I think this is enough but I'd be happy to see also this usecase covered in the future, if it's of interest to you and others - so we can just submit everything to hq.

Dec 13 '21 20:12 giovannipizzi

We want to support "everything in hq", but so far, you are right: upto 1-node tasks to HQ and multi-node tasks directly into SLURM/PBS

Dec 14 '21 13:12 spirali