DeepSpeed
DeepSpeed copied to clipboard
[BUG] export CUDA_VISIBLE_DEVICES=0,1,6,7 does not work
To Reproduce Steps to reproduce the behavior:
$ export CUDA_VISIBLE_DEVICES=0,1,6,7
$ python ./deepy.py ./train.py ./configs/125M.yml ./configs/local_setup.yml
[2023-03-08 12:00:27,863] [INFO] [launch.py:82:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2023-03-08 12:00:27,863] [INFO] [launch.py:88:main] nnodes=1, num_local_procs=4, node_rank=0
[2023-03-08 12:00:27,863] [INFO] [launch.py:103:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2023-03-08 12:00:27,863] [INFO] [launch.py:104:main] dist_world_size=4
[2023-03-08 12:00:27,863] [INFO] [launch.py:112:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
Expected behavior
[2023-03-08 12:00:27,863] [INFO] [launch.py:112:main] Setting CUDA_VISIBLE_DEVICES=0,1,6,7
BUG Code
https://github.com/microsoft/DeepSpeed/blob/58a4a4d4c19bda86d489ac171fa10f3ddb27c9d6/deepspeed/launcher/runner.py#L335-L338
Hi @xu-song, I'll try to repro this and take a look.
At a first glance, it looks like we should be handling that properly here. Could you try setting it this way and let me know if that works for you?
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0,1,6,7"
It still raise error by this setting os.environ["CUDA_VISIBLE_DEVICES"]="0,1,6,7"
https://github.com/microsoft/DeepSpeed/blob/58a4a4d4c19bda86d489ac171fa10f3ddb27c9d6/deepspeed/launcher/runner.py#L290-L292
- slots: [0,1,6,7]
- host_info: {'localhost': [0, 1, 2, 3]}
It works by remove the above raise exception.
@xu-song - thanks, I was able to repro and I'm taking a look at this.
@xu-song - if you using the Deepspeed launcher, this isn't supported, but you can specify the nodes this way:
https://www.deepspeed.ai/getting-started/#resource-configuration-single-node
