DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

Change request regarding the use of CUDA_VISIBLE_DEVICES in deepspeed/launcher/runner.py

Open JY-Ren opened this issue 2 years ago • 1 comments

I tried to specify gpu_ids to train a model on one node and it kept failing. Then I found that CUDA_VISIBLE_DEVICES would be reset if train on single node.

I think it is not necessary to reset CUDA_VISIBLE_DEVICES when training on a single node, as this prevents users from specifying gpu_ids to train their models. I was wondering if you could change this part to make it easier for training model on single node.

Thank you for your attention to this matter.

JY-Ren avatar Mar 09 '23 09:03 JY-Ren

deepspeed --include localhost:1,2 use cuda:1 and cuda:2. (Notice that you can't specify --num_gpus for deepspeed when specifying --include)

c.f. deepspeed --help

richarddwang avatar Mar 16 '23 08:03 richarddwang

Hi @JY-Ren - I believe the suggestion from @richarddwang will solve this for you. We always recommend a hostfile/using that if possible. Please let us know and re-open the issue if you don't find this sufficient.

loadams avatar Aug 18 '23 17:08 loadams