submitit icon indicating copy to clipboard operation
submitit copied to clipboard

How to specify GPUs when executing locally?

Open j0ma opened this issue 3 years ago • 4 comments

I've successfully used submitit to submit jobs to our SLURM cluster, and overall the library works great.

However I'm often faced with a situation where I need to work locally as well and in these situations I would like to control which GPUs are visible to the local executor, similarly to what can be done by setting CUDA_VISIBLE_DEVICES on the command line.

I took a look at the source code for LocalExecutor and was able to find the visible_gpus parameter. However, when I create a local executor using AutoExecutor and try to use update_parameters to set visible_gpus to some value, I encounter an error:

In [39]: ex = st.AutoExecutor(folder='/tmp/testfolder', cluster='local')

In [40]: ex.update_parameters(visible_gpus=[0,1])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
[...]

NameError: Unknown executor 'visible' in parameter 'visible_gpus'.
Known executors: slurm, local, debug
As a reminder, shared/generic (non-prefixed) parameters are: {'name': <class 'str'>, 'timeout_min': <class 'int'>, 'mem_gb': <class 'float'>, 'nodes': <class 'int'>, 'cpus_per_task': <class 'int'>, 'gpus_per_node': <class 'int'>, 'tasks_per_node': <class 'int'>, 'stderr_to_stdout': <class 'bool'>}.

Prefixing the parameter with local_ doesn't help either:

In [41]: ex.update_parameters(local_visible_gpus=[0,1])
[...]
NameError: Unknown argument 'visible_gpus' for executor 'local' in parameter 'local_visible_gpus'. Valid arguments: 
Known executors: slurm, local, debug
As a reminder, shared/generic (non-prefixed) parameters are: {'name': <class 'str'>, 'timeout_min': <class 'int'>, 'mem_gb': <class 'float'>, 'nodes': <class 'int'>, 'cpus_per_task': <class 'int'>, 'gpus_per_node': <class 'int'>, 'tasks_per_node': <class 'int'>, 'stderr_to_stdout': <class 'bool'>}.

However if I create my executor directly as a LocalExecutor, there is no error:

In [43]: loc_ex = st.LocalExecutor(folder="/tmp/testfolder")
In [44]: loc_ex.update_parameters(visible_gpus=[0,1])
In [45]: print('Success')
Success

Is this the intended behavior and am I misunderstanding something or could it be that there is a bug with how AutoExecutor handles this parameter update?

Thanks very much in advance.

j0ma avatar Jun 25 '22 07:06 j0ma

Managed to grab 2 GPUS by setting them visible through CUDA_VISIBLE_DEVICES, and passing in both visible_gpus and gpus_per_node to LocalExecutor. It is necessary to specify both, otherwise this test will pass as gpus_requesteddefaults to 0.

j0ma avatar Jun 26 '22 17:06 j0ma

Actually, it seems like just using AutoExecutor and passing in gpus_per_node seems to work:

executor = st.AutoExecutor(folder=str(log_folder), cluster="local")
try:
    env_cuda_visible_devices = os.environ["CUDA_VISIBLE_DEVICES"]
    _visible_gpus = env_cuda_visible_devices.split(",")
    _visible_gpus = [int(gpuid) for gpuid in _visible_gpus]
except KeyError:
    _visible_gpus = []
executor.update_parameters(
    timeout_min=slurm_timeout_min,
    gpus_per_node=len(_visible_gpus)
)

When I pass in something larger though, AutoExecutor seems to invisibly alter CUDA_VISIBLE_DEVICES:

In [1]: import submitit as st

In [2]: def check_gpus():
   ...:     import torch
   ...:     has_cuda = torch.cuda.is_available()
   ...:     import os
   ...:     try:
   ...:         env_cuda_visible_devices = os.environ["CUDA_VISIBLE_DEVICES"]
   ...:         _visible_gpus = env_cuda_visible_devices.split(",")
   ...:         _visible_gpus = [int(gpuid) for gpuid in _visible_gpus]
   ...:     except KeyError:
   ...:         _visible_gpus = []
   ...:     print(f"Torch has CUDA? {has_cuda}")
   ...:     print(f"Visible GPUs: {_visible_gpus}")
   ...: 

In [3]: ex = st.AutoExecutor(folder='/tmp/testfolder', cluster='local')

In [4]: ex.update_parameters(gpus_per_node=15) # in reality i have none

In [5]: import os

In [6]: print(os.environ.get("CUDA_VISIBLE_DEVICES"))
None

In [7]: j = ex.submit(check_gpus)

In [8]: print(j.stdout())
submitit INFO (2022-06-26 11:06:21,384) - Starting with JobEnvironment(job_id=9827, hostname=jonne-pad, local_rank=0(1), node=0(1), global_rank=0(1))
submitit INFO (2022-06-26 11:06:21,384) - Loading pickle: /tmp/testfolder/9827_submitted.pkl
Torch has CUDA? False
Visible GPUs: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
submitit INFO (2022-06-26 11:06:21,738) - Job completed successfully

As we can see the list [0, 1, ..., 14] corresponds to list(range(gpus_per_node)) which is clearly wrong in my case. I can also imagine a case where I only want a subset of my GPUs to be made available to the executor. Is there a way to specify which GPUs I'd like to use when executing locally?

j0ma avatar Jun 26 '22 18:06 j0ma

Any ideas on this, or am I doing/understanding something incorrectly?

j0ma avatar Aug 03 '22 17:08 j0ma

I find if I set other parameters like tasks_per_node I can't even get the local executor to respect visible_gpus/gpus_per_node

relh avatar Sep 25 '22 13:09 relh

LocalExecutor is not as complicated as SLURM, it doesn't respect number of concurrent job on one machine nor it allocates GPUs to job. You can try using CUDA_VISIBLE_DEVICES to modify how each job executes, but this will be hard and won't scale if you have more jobs than GPUs.

gwenzek avatar Mar 02 '23 14:03 gwenzek