Installation instructions for ROCm absent
The installation instructions state that installing deepspeed is as simple as pip install deepspeed. However, this will install the nvidia version of pytorch.
If I install the ROCm pytorch with;
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/rocm6.4
and then do
pip install deepseek
The command ds_report fails with the following error message;
[2025-09-19 13:31:26,997] [INFO] [real_accelerator.py:260:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Traceback (most recent call last):
File "/home/user/git/openfold/.venv/bin/ds_report", line 3, in <module>
from deepspeed.env_report import cli_main
File "/home/user/git/openfold/.venv/lib/python3.12/site-packages/deepspeed/__init__.py", line 25, in <module>
from . import ops
File "/home/user/git/openfold/.venv/lib/python3.12/site-packages/deepspeed/ops/__init__.py", line 15, in <module>
from ..git_version_info import compatible_ops as __compatible_ops__
File "/home/user/git/openfold/.venv/lib/python3.12/site-packages/deepspeed/git_version_info.py", line 29, in <module>
op_compatible = builder.is_compatible()
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/git/openfold/.venv/lib/python3.12/site-packages/deepspeed/ops/op_builder/fp_quantizer.py", line 35, in is_compatible
sys_cuda_major, _ = installed_cuda_version()
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/git/openfold/.venv/lib/python3.12/site-packages/deepspeed/ops/op_builder/builder.py", line 51, in installed_cuda_version
raise MissingCUDAException("CUDA_HOME does not exist, unable to compile CUDA op(s)")
deepspeed.ops.op_builder.builder.MissingCUDAException: CUDA_HOME does not exist, unable to compile CUDA op(s)
@BSchilperoort - we can work on getting updated instructions in the Readme.
The latest 0.17.5 release is broken for ROCm and we need to push a new update. I'll get to that shortly.
In the meantime, if you build with the latest you can apply this change and see if you're able to build locally?
https://github.com/deepspeedai/DeepSpeed/pull/7521
git clone https://github.com/deepspeedai/DeepSpeed.git
cd DeepSpeed/
python3 -m venv .venv
source .venv/bin/activate
pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm6.4
pip install .
ds_report
Still returns the same error.
I don't see any instructions to build for rocm from source. I used the command from https://github.com/deepspeedai/DeepSpeed/issues/7565
pip install build
DS_ACCELERATOR=cuda LD_LIBRARY_PATH=/opt/rocm-6.4.3/lib PATH=$PATH:/opt/rocm-6.4.3/bin DS_BUILD_SPARSE_ATTN=0 NCCL_DEBUG=INFO DS_BUILD_OPS=1 DS_BUILD_STRING="+rocm" ./install.sh
Returns ModuleNotFoundError: No module named 'dskernels'
Please see my comment https://github.com/deepspeedai/DeepSpeed/issues/7565#issuecomment-3306953101 for a working docker setup.
Please see my comment #7565 (comment) for a working docker setup.
Using that dockerfile (replacing the gpu arch with gfx1101) and then running this;
sudo docker run --rm -ti --device=/dev/kfd --device=/dev/dri --security-opt seccomp=unconfined --group-add video --ipc=host deepspeed bash
root@d779a3a67ae8:~# pip install deepspeed
root@d779a3a67ae8:~# ds_report
Makes ds_report pass, correctly detecting ROCm and the available GPU memory.
Now I need to get this to work outside of Docker, as I need to work on a package that uses openfold, which depends on deepspeed.