toil icon indicating copy to clipboard operation
toil copied to clipboard

balance load over nodes

Open macmanes opened this issue 1 year ago • 4 comments

Using toil from within cactus and a slurm scheduled.

I have 10 nodes available to me and each of them has 40 cores and 500Gb RAM. If I submit 100 jobs, TOIL will submit the jobs to 3 nodes. In my particular case, this is causing oom-kill issues. Is there a way to balance the load - to submit 100 jobs spread evenly over the 10 available nodes?

Thanks in advance for any help available.

┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-1638

macmanes avatar Aug 29 '24 12:08 macmanes

Is there anything particular to the nodes not being assigned? Like, are they partitioned differently, or are fulfilling a GPU requirement? Or are all nodes created equal? Toil does try to detect the overhead of the machines but might not be aware of some intensive background task either.

I'll look into some of the options for Slurm scaling. That sounds odd to me.

DailyDreaming avatar Sep 03 '24 17:09 DailyDreaming

Yes, all nodes and weights are created equally.

I'm wondering if LLN=YES as an argument to slurm is what I want.. https://slurm.schedmd.com/slurm.conf.html#OPT_LLN

macmanes avatar Sep 03 '24 17:09 macmanes

You can use the TOIL_SLURM_ARGS environment variable to add extra command line options to Toil's Slurm calls, but Toil doesn't specifically tell Slurm to pack jobs as tightly as it can into the fewest nodes. I think that might be Slurm's default behavior, since it is designed under the assumption that it is pretty common to want to reserve an entire node for a Slurm job.

If your Slurm jobs are getting OOM-killed, are you sure that your memory limits assigned to your jobs in Cactus are accurate? If they are too low, Slurm I think should detect that you are trying to go over them and OOM-kill your jobs, even if there is free memory on the node not allocated to any jobs.

adamnovak avatar Sep 11 '24 21:09 adamnovak

It looks like LLN is something a Slurm administrator might need to configure for a whole partition, and not an option you can pass to sbatch. Toil's jobs aren't sent to Slurm as an array or as a single Slurm-level batch, so things for scheduling e.g. different instances of an array job on different nodes won't help either.

If your Toil jobs are large enough, you can add the --exclusive option to TOIL_SLURM_ARGS, so that each job will request an entire node to itself. But I don't think Cactus will run well like that; it likes to run a lot of small jobs.

adamnovak avatar Sep 11 '24 21:09 adamnovak