submitit icon indicating copy to clipboard operation
submitit copied to clipboard

`tasks_per_node=1` does not keep the number of tasks to 1 for the `LocalExecutor`

Open ihowell opened this issue 2 years ago • 4 comments

The expected behavior of this parameter setting when using the LocalExecutor (or in my case the AutoExecutor on a non-slurm node) would be to keep the number of spawned processes to 1. I use executor.batch() to perform a delayed batch, which then spawns a processes for each job, which quickly overwhelms my computer.

The issue seems to be that a controller process is spawned per job: https://github.com/facebookincubator/submitit/blob/main/submitit/local/local.py#L163 Each controller processes immediately spawns and runs the controller instead of checking if the number of running controllers is less than the number of tasks allowed.

ihowell avatar Apr 28 '22 03:04 ihowell

Hi @ihowell , I have the same issue. Did you find any solution? Regards

Edit: I was running the tasks from within another repo thinking that it must pass the right parameters. However running the tasks as explained in examples, solved my struggle... -.-

jobs = []
with executor.batch():
    for arg in whatever:
        job = executor.submit(myfunc, arg)
        jobs.append(job)

chirico85 avatar May 03 '22 06:05 chirico85

In general the LocalExecutor has less feature than the SlurmExecutor and indeed if you start 100 jobs using LocalExecutor they will all run at once without regard for the hardware requested or the hardware available on your machine. In short we haven't implemented a queue for LocalExecutor. This is a major footgun, but also not something easily fixable, will need to think about it how to implement this: eg I feel we would like to spawn the subprocess ASAP to be able to return a process id which serves as job id, but make sure the jobs actually start one after the other.

Personally I often use the DebugExecutor which will run exactly one job at once in the current process.

gwenzek avatar May 03 '22 08:05 gwenzek

Thanks for the tip. I would however like to be able to run say 4 jobs at once (number of cores on my machine). Maybe we could use the multiprocessing library instead of subprocesses? This would allow us to use the semaphore while still returning a job construct with a process id I believe.

ihowell avatar May 04 '22 00:05 ihowell

Hi! Is there any updates on this? Is it solved? The same thing happens using Slurm launcher in hydra on clusters for me!

alirezakazemipour avatar Nov 07 '23 18:11 alirezakazemipour