DeepSpeed CUDA_VISIBLE_DEVICES isn't correctly inherited on a SLURM system

Describe the bug This issue occurs on a SLURM cluster where worker nodes equipped with multiple GPU's are shared amongst users. GPU's are given slot number assignments (for example, on a node with 8 GPU's:0-7), and users may be assigned any number of the GPU's of a node by the SLURM scheduler. For example, a SLURM assignment could set CUDA_VISIBLE_DEVICES to 4,5.

Per this bug report, I tried to use the --include flag with the deepspeed command to input my specific GPU numeric assignments (i.e., the values of CUDA_VISIBLE_DEVICES) from SLURM. When I tried the following command:

deepspeed --include localhost:4,5 mycode.py --deepspeed ds_config.json

I receive the following errors:

[2021-08-27 14:16:05,380] [WARNING] [runner.py:122:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
Traceback (most recent call last):
  File "/envs/huggingface_deepspeed_python/bin/deepspeed", line 6, in <module>
    main()
  File "/envs/huggingface_deepspeed_python/lib/python3.9/site-packages/deepspeed/launcher/runner.py", line 280, in main
    active_resources = parse_inclusion_exclusion(resource_pool,
  File "/envs/huggingface_deepspeed_python/lib/python3.9/site-packages/deepspeed/launcher/runner.py", line 248, in parse_inclusion_exclusion
    return parse_resource_filter(active_resources,
  File "envs/huggingface_deepspeed_python/lib/python3.9/site-packages/deepspeed/launcher/runner.py", line 198, in parse_resource_filter
    raise ValueError("No slot '{}' specified on host '{}'".format(
ValueError: No slot '4' specified on host 'localhost'

Note: the value of echo $CUDA_VISIBLE_DEVICES was 4,5 in this example.

When I instead tried the following code:

deepspeed --include localhost:0,1 mycode.py --deepspeed ds_config.json

I successfully ran the mycode.py script with the following output preceding it:

[2021-08-27 14:16:11,581] [WARNING] [runner.py:122:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2021-08-27 14:16:11,717] [INFO] [runner.py:360:main] cmd = /envs/huggingface_deepspeed_python/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 mycode.py --deepspeed ds_config.json
[2021-08-27 14:16:12,341] [INFO] [launch.py:80:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2021-08-27 14:16:12,341] [INFO] [launch.py:86:main] nnodes=1, num_local_procs=2, node_rank=0
[2021-08-27 14:16:12,341] [INFO] [launch.py:101:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2021-08-27 14:16:12,341] [INFO] [launch.py:102:main] dist_world_size=2
[2021-08-27 14:16:12,341] [INFO] [launch.py:104:main] Setting CUDA_VISIBLE_DEVICES=0,1

But as I received GPU's that were already in use, my processes ran out of memory. I.e.,

RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 31.75 GiB total capacity; 550.30 MiB already allocated; 2.75 MiB free; 574.00 MiB reserved in total by PyTorch)

After reading the bug report I referenced above, I noticed the world_info dictionary that's submitted as a base-64 string to the launch script uses torch.cuda.device_count() to create the list of GPU's that should be used. Would it instead be possible to inherit the pre-existing CUDA_VISIBLE_DEVICES assignments?

As a preliminary work-around, I've been creating a custom world_info dictionary and calling the launch script directly, like so:

WID=`echo  {\"localhost\": [$CUDA_VISIBLE_DEVICES]} | base64`
python -u -m deepspeed.launcher.launch --world_info=$WID --master_addr=127.0.0.1 --master_port=29500 mycode.py --deepspeed ds_config.json

Hostfile attempt

I also tried with a specified hostfile. This input:

$ cat myhostfile
workerhostname slots=4,5

resulted in the following output:

[2021-08-27 15:13:12,639] [ERROR] [runner.py:139:fetch_hostfile] Hostfile is not formatted correctly, unable to proceed with training.
Traceback (most recent call last):
  File "/envs/huggingface_deepspeed_python/bin/deepspeed", line 6, in <module>
    main()
  File "/envs/huggingface_deepspeed_python/lib/python3.9/site-packages/deepspeed/launcher/runner.py", line 267, in main
    resource_pool = fetch_hostfile(args.hostfile)
  File "/envs/huggingface_deepspeed_python/lib/python3.9/site-packages/deepspeed/launcher/runner.py", line 141, in fetch_hostfile
    raise err
  File "/envs/huggingface_deepspeed_python/lib/python3.9/site-packages/deepspeed/launcher/runner.py", line 137, in fetch_hostfile
    slot_count = int(slot_count)
ValueError: invalid literal for int() with base 10: '4,5'

Called via:

deepspeed --hostfile myhostfile mycode.py --deepspeed ds_config.json

I presumed I was simply misunderstanding the syntax, so as an experiment I changed the hostfile to:

workerhostname slots=5

I had thought that to mean I was specifying up to 5 GPU's (whereas I'm only assigned / using 2 GPU's in this example). Calling deepspeed as above resulted in the node prompting me for its password, which of course was also not desired behavior.

Expected behavior I expected deepspeed to inherit the specific GPU numeric assignments from CUDA_VISIBLE_DEVICES. It seems as though Deepspeed always re-indexes the assignments of the GPU's to start from 0. I believe this code snippet shows how the values given to the world_info dictionary are created (specifically line 306: list(range(args.num_gpus))).

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/envs/huggingface_deepspeed_python/lib/python3.9/site-packages/torch']
torch version .................... 1.9.0
torch cuda version ............... 10.2
nvcc version ..................... 10.2
deepspeed install path ........... ['/envs/huggingface_deepspeed_python/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.5.0, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.9, cuda 10.2

Software Version Notes I'm running Python 3.9.6 within a Conda library where I've installed deepspeed via pip (and all other libraries via conda install).

Additional Notes and Future Usage Currently I'm only running these workflows on single node, which has anywhere from 8 to 16 GPU's. However, I am interested in applying these workflows across multiple nodes that have high-speed interconnects.

Please let me know if you require any further information.

Aug 27 '21 13:08 devinrouthuzh

I wanted to quickly follow up on this bug report to see if there has been any discussion on a fix. Since the underlying issue is that the CUDA_VISIBLE_DEVICES variable isn't correctly inherited (i.e., it reindexes from 0 to the number of GPU's listed in the variable), the problem may not be limited to SLURM systems.

Per my report, if I'm misunderstanding any of DeepSpeed's functionality, please let me know. Thanks a bunch!

Sep 17 '21 13:09 devinrouthuzh

I am having the same issue. Has this been solved?

Update: I realized that by setting --num_gpus (or --include, etc), we are basically overwriting the CUDA_VISIBLE_DEVICES environment variable. In a slurm system, the node only sees what it gets, so set the number of GPUs via slurm, and do not set num_gpus for DeepSpeed. DeepSpeed also by default uses all the GPUs it sees when num_gpus is not set.

Jan 24 '23 20:01 habvt

same issue here. Any solution?

Oct 08 '23 10:10 leesky1c

same issue here. Any solution?

you can add --include localhost:0,1 in DeepSpeed command, in my case, my gpu is 3,4,5,6 in slurm node, so I need to use --include localhost:0,1,2,3 haha

Oct 17 '23 01:10 CaesarWWK

deepspeed --include localhost:4,5 --master_port=29501 inference.py works for me

Dec 19 '23 12:12 kushshrivastava

I had the same issue but on an SGE cluster, sharing what worked for me in case it helps anyone.

My understanding is that when you submit a job with let's say 4 GPUs (4,5,6,7), the scheduler sets CUDA_VISIBLE_DEVICES to these values. When you run deepspeed --include localhost:4,5,6,7 train.py, deepspeed raises an error meaning that it can't find GPU 4 on this machine.

What works, is to reset CUDA_VISIBLE_DEVICES and allow deepspeed to figure it out, i.e. CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" deepspeed --include localhost:4,5,6,7 train.py Internally, deepspeed will set CUDA_VISIBLE_DEVICES to 4,5,6,7, so you don't need to worry about using GPUs that are not allocated to you.

This worked for me at least. In your case, the command would be CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" deepspeed --include localhost:4,5 mycode.py --deepspeed ds_config.json

To be honest, I don't have a solid explanation for this behaviour. But a naive one would be that when CUDA_VISIBLE_DEVICES=4,5,6,7, deepspeed somehow understands that this node has only 4 GPUs, and it always indexes starting from 0, so can only find GPUs up to 3. For example, if by chance the scheduler allocated GPUs 0,1,2,3 to you, the command will work fine. But if you got allocated any GPU > 3, it will raise the error that this node doesn't have this GPU.

Please let me know if you have a better explanation and hope this helps.

Jan 04 '24 10:01 ahmedhshahin

is there any news on this?

I have the same issues, and the solution provided by @ahmedhshahin, while it may work, is against the best practices of slurm of not modifying the environment variables set by slurm itself.

CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" deepspeed --include localhost:4,5 mycode.py --deepspeed ds_config.json

Apr 09 '24 09:04 FGiuliari

I agree, it is not a good practice and a better solution is needed to correctly handle this case.

Apr 13 '24 15:04 ahmedhshahin

DeepSpeed DeepSpeed copied to clipboard

CUDA_VISIBLE_DEVICES isn't correctly inherited on a SLURM system

DeepSpeed
DeepSpeed copied to clipboard