DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] export CUDA_VISIBLE_DEVICES=0,1,6,7 does not work

Open xu-song opened this issue 2 years ago • 3 comments

To Reproduce Steps to reproduce the behavior:

$ export CUDA_VISIBLE_DEVICES=0,1,6,7
$ python ./deepy.py ./train.py ./configs/125M.yml ./configs/local_setup.yml


[2023-03-08 12:00:27,863] [INFO] [launch.py:82:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2023-03-08 12:00:27,863] [INFO] [launch.py:88:main] nnodes=1, num_local_procs=4, node_rank=0
[2023-03-08 12:00:27,863] [INFO] [launch.py:103:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2023-03-08 12:00:27,863] [INFO] [launch.py:104:main] dist_world_size=4
[2023-03-08 12:00:27,863] [INFO] [launch.py:112:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3

Expected behavior

[2023-03-08 12:00:27,863] [INFO] [launch.py:112:main] Setting CUDA_VISIBLE_DEVICES=0,1,6,7

BUG Code

https://github.com/microsoft/DeepSpeed/blob/58a4a4d4c19bda86d489ac171fa10f3ddb27c9d6/deepspeed/launcher/runner.py#L335-L338

xu-song avatar Mar 09 '23 02:03 xu-song

Hi @xu-song, I'll try to repro this and take a look.

loadams avatar Mar 10 '23 15:03 loadams

At a first glance, it looks like we should be handling that properly here. Could you try setting it this way and let me know if that works for you?

import os
os.environ["CUDA_VISIBLE_DEVICES"]="0,1,6,7"

loadams avatar Mar 10 '23 16:03 loadams

It still raise error by this setting os.environ["CUDA_VISIBLE_DEVICES"]="0,1,6,7"

https://github.com/microsoft/DeepSpeed/blob/58a4a4d4c19bda86d489ac171fa10f3ddb27c9d6/deepspeed/launcher/runner.py#L290-L292

  • slots: [0,1,6,7]
  • host_info: {'localhost': [0, 1, 2, 3]}

It works by remove the above raise exception.

xu-song avatar Mar 13 '23 10:03 xu-song

@xu-song - thanks, I was able to repro and I'm taking a look at this.

loadams avatar Mar 17 '23 18:03 loadams

@xu-song - if you using the Deepspeed launcher, this isn't supported, but you can specify the nodes this way:

https://www.deepspeed.ai/getting-started/#resource-configuration-single-node

image

loadams avatar Apr 14 '23 15:04 loadams