DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[REQUEST] Support for compile without NCCL dependency

Open studyingflying opened this issue 8 months ago • 5 comments

Is your feature request related to a problem? Please describe. On my hardware platform Jeton Agx Orin, the system does not support NCCL libraries。So i can not compile Deepspeed on my device

Describe the solution you'd like Deepspeed support compilation without NCCL, and also give the alternative parameter just like pytorch setup.py, I can set export USE_NCCL=0 export USE_MPI=1

Describe alternatives you've considered

Additional context My environment:

  • OS: Ubuntu 22.04
  • GPU :Jetson AGX Orin
  • CPU: ARM64v8
  • Python :3.10.12
  • CUDA:11.6

studyingflying avatar Apr 03 '25 09:04 studyingflying

related: #4104 /cc @tjruwase

studyingflying avatar Apr 03 '25 09:04 studyingflying

@studyingflying - can you share the error that you hit when installing DeepSpeed?

loadams avatar Apr 03 '25 15:04 loadams

@studyingflying - can you share the error that you hit when installing DeepSpeed?

@loadams thanks for reply, here comes the error, it is caused by my pytorch installed without NCCL support, as my device doesn't support NCCL, I really want to find a alternative way to complie DeepSpeed.

$: pip install deepspeed
Collecting deepspeed
  Using cached deepspeed-0.16.5.tar.gz (1.5 MB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [13 lines of output]
      [2025-04-04 00:31:59,134] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
      [2025-04-04 00:31:59,383] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
      fatal: not a git repository (or any of the parent directories): .git
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-htrihlss/deepspeed_74c674a32cbe40df83a701dcbbe065b0/setup.py", line 262, in <module>
          if isinstance(torch.cuda.nccl.version(), int):
        File "/mnt/anaconda3/envs/deepspeed/lib/python3.10/site-packages/torch/cuda/nccl.py", line 45, in version
          ver = torch._C._nccl_version()
      AttributeError: module 'torch._C' has no attribute '_nccl_version'
      DS_BUILD_OPS=0
      Install Ops={'async_io': False, 'fused_adam': False, 'cpu_adam': False, 'cpu_adagrad': False, 'cpu_lion': False, 'evoformer_attn': False, 'fp_quantizer': False, 'fused_lamb': False, 'fused_lion': False, 'gds': False, 'transformer_inference': False, 'inference_core_ops': False, 'cutlass_ops': False, 'quantizer': False, 'ragged_device_ops': False, 'ragged_ops': False, 'random_ltd': False, 'sparse_attn': False, 'spatial_inference': False, 'transformer': False, 'stochastic_transformer': False}
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

studyingflying avatar Apr 03 '25 16:04 studyingflying

I met the same problem, have you solved it?

Logos515 avatar Nov 09 '25 02:11 Logos515

I met the same problem, have you solved it?

I solved it by editting the setup.py file in the deepspeed repository and building it from source.

git clone https://github.com/microsoft/DeepSpeed cd DeepSpeed

About line 260, find the content below:

if torch_available and torch.version.cuda is not None:
    cuda_version = ".".join(torch.version.cuda.split('.')[:2])
    if sys.platform != "win32":
        if isinstance(torch.cuda.nccl.version(), int):
            # This will break if minor version > 9.
            nccl_version = ".".join(str(torch.cuda.nccl.version())[:2])
        else:
            nccl_version = ".".join(map(str, torch.cuda.nccl.version()[:2]))
    if hasattr(torch.cuda, 'is_bf16_supported') and torch.cuda.is_available():
        bf16_support = torch.cuda.is_bf16_supported()

Change it to:

if torch_available and torch.version.cuda is not None:
    cuda_version = ".".join(torch.version.cuda.split('.')[:2])
    nccl_version = "0.0"  # default value
    try:
        if sys.platform != "win32" and hasattr(torch.cuda, "nccl") and hasattr(torch.cuda.nccl, "version"):
            ver = torch.cuda.nccl.version()
            if isinstance(ver, int):
                nccl_version = ".".join(str(ver)[:2])
            else:
                nccl_version = ".".join(map(str, ver[:2]))
    except Exception as e:
        print(f"[DeepSpeed setup] Warning: NCCL version detection failed ({e}), skipping NCCL setup.")
    if hasattr(torch.cuda, 'is_bf16_supported') and torch.cuda.is_available():
        bf16_support = torch.cuda.is_bf16_supported()

then just install it by

DS_BUILD_OPS=0 pip install . --no-build-isolation

This will skip the function torch.cuda.nccl.version() and successfully install deepspeed

Logos515 avatar Nov 09 '25 02:11 Logos515