How to specify GPUs when executing locally?
I've successfully used submitit to submit jobs to our SLURM cluster, and overall the library works great.
However I'm often faced with a situation where I need to work locally as well and in these situations I would like to control which GPUs are visible to the local executor, similarly to what can be done by setting CUDA_VISIBLE_DEVICES on the command line.
I took a look at the source code for LocalExecutor and was able to find the visible_gpus parameter. However, when I create a local executor using AutoExecutor and try to use update_parameters to set visible_gpus to some value, I encounter an error:
In [39]: ex = st.AutoExecutor(folder='/tmp/testfolder', cluster='local')
In [40]: ex.update_parameters(visible_gpus=[0,1])
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
[...]
NameError: Unknown executor 'visible' in parameter 'visible_gpus'.
Known executors: slurm, local, debug
As a reminder, shared/generic (non-prefixed) parameters are: {'name': <class 'str'>, 'timeout_min': <class 'int'>, 'mem_gb': <class 'float'>, 'nodes': <class 'int'>, 'cpus_per_task': <class 'int'>, 'gpus_per_node': <class 'int'>, 'tasks_per_node': <class 'int'>, 'stderr_to_stdout': <class 'bool'>}.
Prefixing the parameter with local_ doesn't help either:
In [41]: ex.update_parameters(local_visible_gpus=[0,1])
[...]
NameError: Unknown argument 'visible_gpus' for executor 'local' in parameter 'local_visible_gpus'. Valid arguments:
Known executors: slurm, local, debug
As a reminder, shared/generic (non-prefixed) parameters are: {'name': <class 'str'>, 'timeout_min': <class 'int'>, 'mem_gb': <class 'float'>, 'nodes': <class 'int'>, 'cpus_per_task': <class 'int'>, 'gpus_per_node': <class 'int'>, 'tasks_per_node': <class 'int'>, 'stderr_to_stdout': <class 'bool'>}.
However if I create my executor directly as a LocalExecutor, there is no error:
In [43]: loc_ex = st.LocalExecutor(folder="/tmp/testfolder")
In [44]: loc_ex.update_parameters(visible_gpus=[0,1])
In [45]: print('Success')
Success
Is this the intended behavior and am I misunderstanding something or could it be that there is a bug with how AutoExecutor handles this parameter update?
Thanks very much in advance.
Managed to grab 2 GPUS by setting them visible through CUDA_VISIBLE_DEVICES, and passing in both visible_gpus and gpus_per_node to LocalExecutor. It is necessary to specify both, otherwise this test will pass as gpus_requesteddefaults to 0.
Actually, it seems like just using AutoExecutor and passing in gpus_per_node seems to work:
executor = st.AutoExecutor(folder=str(log_folder), cluster="local")
try:
env_cuda_visible_devices = os.environ["CUDA_VISIBLE_DEVICES"]
_visible_gpus = env_cuda_visible_devices.split(",")
_visible_gpus = [int(gpuid) for gpuid in _visible_gpus]
except KeyError:
_visible_gpus = []
executor.update_parameters(
timeout_min=slurm_timeout_min,
gpus_per_node=len(_visible_gpus)
)
When I pass in something larger though, AutoExecutor seems to invisibly alter CUDA_VISIBLE_DEVICES:
In [1]: import submitit as st
In [2]: def check_gpus():
...: import torch
...: has_cuda = torch.cuda.is_available()
...: import os
...: try:
...: env_cuda_visible_devices = os.environ["CUDA_VISIBLE_DEVICES"]
...: _visible_gpus = env_cuda_visible_devices.split(",")
...: _visible_gpus = [int(gpuid) for gpuid in _visible_gpus]
...: except KeyError:
...: _visible_gpus = []
...: print(f"Torch has CUDA? {has_cuda}")
...: print(f"Visible GPUs: {_visible_gpus}")
...:
In [3]: ex = st.AutoExecutor(folder='/tmp/testfolder', cluster='local')
In [4]: ex.update_parameters(gpus_per_node=15) # in reality i have none
In [5]: import os
In [6]: print(os.environ.get("CUDA_VISIBLE_DEVICES"))
None
In [7]: j = ex.submit(check_gpus)
In [8]: print(j.stdout())
submitit INFO (2022-06-26 11:06:21,384) - Starting with JobEnvironment(job_id=9827, hostname=jonne-pad, local_rank=0(1), node=0(1), global_rank=0(1))
submitit INFO (2022-06-26 11:06:21,384) - Loading pickle: /tmp/testfolder/9827_submitted.pkl
Torch has CUDA? False
Visible GPUs: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
submitit INFO (2022-06-26 11:06:21,738) - Job completed successfully
As we can see the list [0, 1, ..., 14] corresponds to list(range(gpus_per_node)) which is clearly wrong in my case.
I can also imagine a case where I only want a subset of my GPUs to be made available to the executor.
Is there a way to specify which GPUs I'd like to use when executing locally?
Any ideas on this, or am I doing/understanding something incorrectly?
I find if I set other parameters like tasks_per_node I can't even get the local executor to respect visible_gpus/gpus_per_node
LocalExecutor is not as complicated as SLURM, it doesn't respect number of concurrent job on one machine nor it allocates GPUs to job. You can try using CUDA_VISIBLE_DEVICES to modify how each job executes, but this will be hard and won't scale if you have more jobs than GPUs.