[REQUEST] Support for compile without NCCL dependency
Is your feature request related to a problem? Please describe. On my hardware platform Jeton Agx Orin, the system does not support NCCL libraries。So i can not compile Deepspeed on my device
Describe the solution you'd like Deepspeed support compilation without NCCL, and also give the alternative parameter just like pytorch setup.py, I can set export USE_NCCL=0 export USE_MPI=1
Describe alternatives you've considered
Additional context My environment:
- OS: Ubuntu 22.04
- GPU :Jetson AGX Orin
- CPU: ARM64v8
- Python :3.10.12
- CUDA:11.6
related: #4104 /cc @tjruwase
@studyingflying - can you share the error that you hit when installing DeepSpeed?
@studyingflying - can you share the error that you hit when installing DeepSpeed?
@loadams thanks for reply, here comes the error, it is caused by my pytorch installed without NCCL support, as my device doesn't support NCCL, I really want to find a alternative way to complie DeepSpeed.
$: pip install deepspeed
Collecting deepspeed
Using cached deepspeed-0.16.5.tar.gz (1.5 MB)
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error
× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [13 lines of output]
[2025-04-04 00:31:59,134] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-04 00:31:59,383] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
fatal: not a git repository (or any of the parent directories): .git
Traceback (most recent call last):
File "<string>", line 2, in <module>
File "<pip-setuptools-caller>", line 34, in <module>
File "/tmp/pip-install-htrihlss/deepspeed_74c674a32cbe40df83a701dcbbe065b0/setup.py", line 262, in <module>
if isinstance(torch.cuda.nccl.version(), int):
File "/mnt/anaconda3/envs/deepspeed/lib/python3.10/site-packages/torch/cuda/nccl.py", line 45, in version
ver = torch._C._nccl_version()
AttributeError: module 'torch._C' has no attribute '_nccl_version'
DS_BUILD_OPS=0
Install Ops={'async_io': False, 'fused_adam': False, 'cpu_adam': False, 'cpu_adagrad': False, 'cpu_lion': False, 'evoformer_attn': False, 'fp_quantizer': False, 'fused_lamb': False, 'fused_lion': False, 'gds': False, 'transformer_inference': False, 'inference_core_ops': False, 'cutlass_ops': False, 'quantizer': False, 'ragged_device_ops': False, 'ragged_ops': False, 'random_ltd': False, 'sparse_attn': False, 'spatial_inference': False, 'transformer': False, 'stochastic_transformer': False}
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed
× Encountered error while generating package metadata.
╰─> See above for output.
note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
I met the same problem, have you solved it?
I met the same problem, have you solved it?
I solved it by editting the setup.py file in the deepspeed repository and building it from source.
git clone https://github.com/microsoft/DeepSpeed
cd DeepSpeed
About line 260, find the content below:
if torch_available and torch.version.cuda is not None:
cuda_version = ".".join(torch.version.cuda.split('.')[:2])
if sys.platform != "win32":
if isinstance(torch.cuda.nccl.version(), int):
# This will break if minor version > 9.
nccl_version = ".".join(str(torch.cuda.nccl.version())[:2])
else:
nccl_version = ".".join(map(str, torch.cuda.nccl.version()[:2]))
if hasattr(torch.cuda, 'is_bf16_supported') and torch.cuda.is_available():
bf16_support = torch.cuda.is_bf16_supported()
Change it to:
if torch_available and torch.version.cuda is not None:
cuda_version = ".".join(torch.version.cuda.split('.')[:2])
nccl_version = "0.0" # default value
try:
if sys.platform != "win32" and hasattr(torch.cuda, "nccl") and hasattr(torch.cuda.nccl, "version"):
ver = torch.cuda.nccl.version()
if isinstance(ver, int):
nccl_version = ".".join(str(ver)[:2])
else:
nccl_version = ".".join(map(str, ver[:2]))
except Exception as e:
print(f"[DeepSpeed setup] Warning: NCCL version detection failed ({e}), skipping NCCL setup.")
if hasattr(torch.cuda, 'is_bf16_supported') and torch.cuda.is_available():
bf16_support = torch.cuda.is_bf16_supported()
then just install it by
DS_BUILD_OPS=0 pip install . --no-build-isolation
This will skip the function torch.cuda.nccl.version() and successfully install deepspeed