DeepSpeed
DeepSpeed copied to clipboard
Change request regarding the use of CUDA_VISIBLE_DEVICES in deepspeed/launcher/runner.py
I tried to specify gpu_ids to train a model on one node and it kept failing. Then I found that CUDA_VISIBLE_DEVICES would be reset if train on single node.
I think it is not necessary to reset CUDA_VISIBLE_DEVICES when training on a single node, as this prevents users from specifying gpu_ids to train their models. I was wondering if you could change this part to make it easier for training model on single node.
Thank you for your attention to this matter.
deepspeed --include localhost:1,2 use cuda:1 and cuda:2. (Notice that you can't specify --num_gpus for deepspeed when specifying --include)
c.f. deepspeed --help
Hi @JY-Ren - I believe the suggestion from @richarddwang will solve this for you. We always recommend a hostfile/using that if possible. Please let us know and re-open the issue if you don't find this sufficient.