DeepSpeed
DeepSpeed copied to clipboard
[BUG] Launcher does not honor CUDA_VISIBLE_DEVICES
Describe the bug The documentation here implies that CUDA_VISIBLE_DEVICES is not supported, but the launcher script does attempt to handle that case. Given that CUDA_VISIBLE_DEVICES is used so commonly, I think this still qualifies as a bug.
On a single node, when CUDA_VISIBLE_DEVICES=0,2
for example, when I launch deepspeed I get the error.
ValueError: No slot '2' specified on host 'localhost'
There is a simple, if clunky, workaround.
INCLUDE_STR="localhost:$CUDA_VISIBLE_DEVICES"
unset CUDA_VISIBLE_DEVICES
deepspeed --include $INCLUDE_STR ...
To Reproduce Steps to reproduce the behavior:
- Simple inference script to reproduce
import collections
import os
import pytest
class FakeAccelerator:
def __init__(self, num_devices: int = 2):
self.num_devices = num_devices
def device_count(self) -> int:
return self.num_devices
def test_main_with_cuda_visible_devices(monkeypatch):
fake_acc = FakeAccelerator(2)
from deepspeed.launcher import runner
monkeypatch.setattr(runner, "get_accelerator", lambda: fake_acc)
cvd = "0,1"
os.environ["CUDA_VISIBLE_DEVICES"] = cvd
runner.main()
cvd = "0,2"
os.environ["CUDA_VISIBLE_DEVICES"] = cvd
with pytest.raises(ValueError):
runner.main()
def test_parse_resource_filter():
from deepspeed.launcher.runner import parse_resource_filter
resource_pool = collections.OrderedDict({"localhost": list(range(2))})
parse_resource_filter(resource_pool, include_str="localhost:0,1", exclude_str="")
with pytest.raises(ValueError):
parse_resource_filter(
resource_pool, include_str="localhost:0,2", exclude_str=""
)
def test_parse_inclusion_exclusion():
from deepspeed.launcher.runner import parse_inclusion_exclusion
resource_pool = collections.OrderedDict({"localhost": 2})
parse_inclusion_exclusion(resource_pool, inclusion="localhost:0,1", exclusion="")
with pytest.raises(ValueError):
parse_inclusion_exclusion(
resource_pool, inclusion="localhost:0,2", exclusion=""
)
- What packages are required and their versions
pytest
- How to run the script
pytest tests/unit/launcher/test_cuda_visible_devices.py
Expected behavior Deepspeed should launch, setting include str to "localhost:0,2"
Additional context
As I mentioned in #4248, the code modifies the include_str to match CUDA_VISIBLE_DEVICES, but then relies on the accelerator to determine total number of devices. For the cuda accelerator, device_count considers CUDA_VISIBLE_DEVICES if set. Then it assumes in the parse_inclusion_exclusion
function, it is assumed that the devices are numbered consecutively, starting from zero. This leads to a mismatch when trying to reconcile the include_str and the introspected resources when the index of visible devices is greater or equal to the total number of devices. Note that this would not be a problem if the accelerator always returned a device count of all physical devices, but in the case of the cuda accelerator, torch.cuda.device_count
is used, which uses a cached value if possilble. So even though runner.main
unsets the CUDA_VISIBLE_DEVICES env var, torch has likely already grabbed the value.