[BUG] `import deepspeed` crashes on `deepspeed==0.16.3` with `triton==3.2.0` on CPU machine
Describe the bug A clear and concise description of what the bug is.
- deepspeed uses the @triton.autotuner decorator, which leads to the autotuner being initialized when
import deepspeedhappens - in triton 3.2.0, they add logic to the autotuner that leads to a check for torch.cuda.is_available() in the autotuner constructor
Before this updates of triton, it's safe to import deepspeed on a CPU machine.
To Reproduce
run import deepspeed on a CPU machine will lead to the following error message:
>>> import deepspeed
[2025-02-12 18:28:06,516] [WARNING] [real_accelerator.py:181:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.
[2025-02-12 18:28:06,530] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cpu (auto detect)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/__init__.py", line 25, in <module>
from . import ops
File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/__init__.py", line 11, in <module>
from . import transformer
File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/transformer/__init__.py", line 7, in <module>
from .inference.config import DeepSpeedInferenceConfig
File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/__init__.py", line 7, in <module>
from ....model_implementations.transformers.ds_transformer import DeepSpeedTransformerInference
File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/model_implementations/__init__.py", line 6, in <module>
from .transformers.ds_transformer import DeepSpeedTransformerInference
File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/model_implementations/transformers/ds_transformer.py", line 18, in <module>
from deepspeed.ops.transformer.inference.triton.mlp import TritonMLP
File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/__init__.py", line 10, in <module>
from .ops import *
File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/ops.py", line 6, in <module>
import deepspeed.ops.transformer.inference.triton.matmul_ext as matmul_ext
File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 10, in <module>
import deepspeed.ops.transformer.inference.triton.triton_matmul_kernel as triton_matmul_kernel
File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/triton_matmul_kernel.py", line 120, in <module>
def _fp_matmul(
File "/home/ray/anaconda3/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 368, in decorator
return Autotuner(fn, fn.arg_names, configs, key, reset_to_zero, restore_value, pre_hook=pre_hook,
File "/home/ray/anaconda3/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 130, in __init__
self.do_bench = driver.active.get_benchmarker()
File "/home/ray/anaconda3/lib/python3.9/site-packages/triton/runtime/driver.py", line 23, in __getattr__
self._initialize_obj()
File "/home/ray/anaconda3/lib/python3.9/site-packages/triton/runtime/driver.py", line 20, in _initialize_obj
self._obj = self._init_fn()
File "/home/ray/anaconda3/lib/python3.9/site-packages/triton/runtime/driver.py", line 8, in _create_driver
raise RuntimeError(f"{len(actives)} active drivers ({actives}). There should only be one.")
RuntimeError: 0 active drivers ([]). There should only be one.
Expected behavior A clear and concise description of what you expected to happen.
import deepspeed on a CPU mahcine should not crash.
ds_report output
Please run ds_report to give us details about your setup.
Screenshots If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
- OS: [e.g. Ubuntu 18.04]
- GPU count and types [e.g. two machines with x8 A100s each]
- Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
- Python version
- Any other relevant info about your setup
Launcher context
Are you launching your experiment with the deepspeed launcher, MPI, or something else?
Docker context Are you using a specific docker image that you can share?
Additional context Add any other context about the problem here.
@hongpeng-guo, thanks for reporting this. However, I think something else is going since
- I am unable to repro this issue using DeepSpeed master. Please see below.
- There is a guard to avoid triton import on cpu accelerator
Please share the result of your ds_report
2025-02-12 22:57:55,224] [WARNING] [real_accelerator.py:181:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.
[2025-02-12 22:57:55,231] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cpu (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
deepspeed_not_implemented [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
deepspeed_ccl_comm ..... [NO] ....... [OKAY]
deepspeed_shm_comm ..... [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/tjruwase/py_venv/torch_2_cpu/lib/python3.12/site-packages/torch']
torch version .................... 2.4.1+cpu
deepspeed install path ........... ['/home/tjruwase/py_venv/torch_2_cpu/lib/python3.12/site-packages/deepspeed']
deepspeed info ................... 0.16.4+079de6bd, 079de6bd, master
deepspeed wheel compiled w. ...... torch 0.0
shared memory (/dev/shm) size .... 125.77 GB
(torch_2_cpu) tjruwase@IronLambda:~/projects/DeepSpeed/public/master$ python
Python 3.12.3 (main, Jan 17 2025, 18:03:48) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import deepspeed
[2025-02-12 22:58:03,304] [WARNING] [real_accelerator.py:181:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.
[2025-02-12 22:58:03,312] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cpu (auto detect)
[2025-02-12 22:58:03,403] [WARNING] [real_accelerator.py:181:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.
[2025-02-12 22:58:03,405] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cpu (auto detect)
from>>> from deepspeed import get_accelerator
>>> get_accelerator()._name
'cpu'
>>>
@tjruwase Thanks for your prompt reply. I think it is a joint issue with DeepSpeed and triton. The issue happens on triton==3.2.0, the latest version of triton. Could you try to upgrade triton of your environment and try it again? Let me see how to run ds_report on my side. I think the guard looks cool, adding another one on the .ops path might solve the problem.
@tjruwase btw, here is what I saw when running ds_report on a pure CPU node. It can be reproed with triton==3.2.0. I think it doesn't suggest there was a bug in deepspeed. But some changes of the recent triton makes deepspeed import fail on pure CPU node.
base) ray@ip-10-0-6-136:~/default$ ds_report
[2025-02-13 18:01:48,138] [WARNING] [real_accelerator.py:181:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.
[2025-02-13 18:01:48,156] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cpu (auto detect)
Traceback (most recent call last):
File "/home/ray/anaconda3/bin/ds_report", line 3, in <module>
from deepspeed.env_report import cli_main
File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/__init__.py", line 25, in <module>
from . import ops
File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/__init__.py", line 11, in <module>
from . import transformer
File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/transformer/__init__.py", line 7, in <module>
from .inference.config import DeepSpeedInferenceConfig
File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/__init__.py", line 7, in <module>
from ....model_implementations.transformers.ds_transformer import DeepSpeedTransformerInference
File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/model_implementations/__init__.py", line 6, in <module>
from .transformers.ds_transformer import DeepSpeedTransformerInference
File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/model_implementations/transformers/ds_transformer.py", line 18, in <module>
from deepspeed.ops.transformer.inference.triton.mlp import TritonMLP
File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/__init__.py", line 10, in <module>
from .ops import *
File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/ops.py", line 6, in <module>
import deepspeed.ops.transformer.inference.triton.matmul_ext as matmul_ext
File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 10, in <module>
import deepspeed.ops.transformer.inference.triton.triton_matmul_kernel as triton_matmul_kernel
File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/triton_matmul_kernel.py", line 120, in <module>
def _fp_matmul(
File "/home/ray/anaconda3/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 368, in decorator
return Autotuner(fn, fn.arg_names, configs, key, reset_to_zero, restore_value, pre_hook=pre_hook,
File "/home/ray/anaconda3/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 130, in __init__
self.do_bench = driver.active.get_benchmarker()
File "/home/ray/anaconda3/lib/python3.9/site-packages/triton/runtime/driver.py", line 23, in __getattr__
self._initialize_obj()
File "/home/ray/anaconda3/lib/python3.9/site-packages/triton/runtime/driver.py", line 20, in _initialize_obj
self._obj = self._init_fn()
File "/home/ray/anaconda3/lib/python3.9/site-packages/triton/runtime/driver.py", line 8, in _create_driver
raise RuntimeError(f"{len(actives)} active drivers ({actives}). There should only be one.")
RuntimeError: 0 active drivers ([]). There should only be one.
@hongpeng-guo - I was not able to repro this on a GPU node:
annotated-types 0.7.0
deepspeed 0.16.4
einops 0.8.1
filelock 3.13.1
fsspec 2024.6.1
hjson 3.1.0
Jinja2 3.1.3
MarkupSafe 2.1.5
mpmath 1.3.0
msgpack 1.1.0
networkx 3.3
ninja 1.11.1.4
numpy 2.1.2
nvidia-cublas-cu12 12.6.4.1
nvidia-cuda-cupti-cu12 12.6.80
nvidia-cuda-nvrtc-cu12 12.6.77
nvidia-cuda-runtime-cu12 12.6.77
nvidia-cudnn-cu12 9.5.1.17
nvidia-cufft-cu12 11.3.0.4
nvidia-curand-cu12 10.3.7.77
nvidia-cusolver-cu12 11.7.1.2
nvidia-cusparse-cu12 12.5.4.2
nvidia-cusparselt-cu12 0.6.3
nvidia-nccl-cu12 2.21.5
nvidia-nvjitlink-cu12 12.6.85
nvidia-nvtx-cu12 12.6.77
packaging 24.2
pillow 11.0.0
pip 23.0.1
psutil 7.0.0
py-cpuinfo 9.0.0
pydantic 2.11.0
pydantic_core 2.33.0
setuptools 65.5.0
sympy 1.13.1
torch 2.6.0+cu126
torchaudio 2.6.0+cu126
torchvision 0.21.0+cu126
tqdm 4.67.1
triton 3.2.0
typing_extensions 4.12.2
typing-inspection 0.4.0
I installed torch first, then triton, then deepspeed. I get no errors when installing or running ds_report.
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
[WARNING] NVIDIA Inference is only supported on Ampere and newer architectures
[WARNING] FP Quantizer is using an untested triton version (3.2.0), only 2.3.(0, 1) and 3.0.0 are known to be compatible with these kernels
fp_quantizer ........... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
gcc -pthread -B /opt/conda/envs/ptca/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /opt/conda/envs/ptca/include -fPIC -O2 -isystem /opt/conda/envs/ptca/include -fPIC -c /tmp/tmphvvl2dz4/test.c -o /tmp/tmphvvl2dz4/test.o
gcc -pthread -B /opt/conda/envs/ptca/compiler_compat /tmp/tmphvvl2dz4/test.o -L/usr/local/cuda -L/usr/local/cuda/lib64 -lcufile -o /tmp/tmphvvl2dz4/a.out
gds .................... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.6
[WARNING] using untested triton version (3.2.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/scratch/azureml/cr/j/8b1e8f53b76b44119b75adb0324fadb8/exe/wd/triton/lib/python3.10/site-packages/torch']
torch version .................... 2.6.0+cu126
deepspeed install path ........... ['/scratch/azureml/cr/j/8b1e8f53b76b44119b75adb0324fadb8/exe/wd/triton/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.16.4, unknown, unknown
torch cuda version ............... 12.6
torch hip version ................ None
nvcc version ..................... 12.4
deepspeed wheel compiled w. ...... torch 0.0, cuda 0.0
shared memory (/dev/shm) size .... 723.00 GB
Could you share any other ways to repro this? Or only on a cpu node? I tried specifying the DS_ACCELERATOR=cpu but was not able to repro it that way either.
I get no error in ds_report as well but when I try to run some scripts with verl library and ulysses paralellism I face the same error.
Hi all, thanks for all the information in this thread. Here are my findings after investigating the issue:
TL;DR:
- Upgrade to
deepspeed>=0.16.4. - If not possible, just
pip uninstall tritonmanually (if you don't use it). - If not possible, move your deepspeed imports inside your training function.
- If not possible, we're shipping a mitigation in Ray Train V2 for the next release (enable it via
RAY_TRAIN_V2_ENABLED=1).
Issue Summary
Deepspeed has a triton dependency. Triton has a CUDA dependency. Trying to import deepspeed on a process with CUDA_VISIBLE_DEVICES unset and where triton is installed (it is by default) will raise an error because deepspeed<=0.16.3 imports triton as long as it's installed.
Minimal repro on a CPU node:
>>> import deepspeed
[2025-04-28 14:27:02,257] [WARNING] [real_accelerator.py:181:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.
[2025-04-28 14:27:02,266] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cpu (auto detect)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ray/anaconda3/lib/python3.12/site-packages/deepspeed/__init__.py", line 25, in <module>
from . import ops
File "/home/ray/anaconda3/lib/python3.12/site-packages/deepspeed/ops/__init__.py", line 11, in <module>
from . import transformer
File "/home/ray/anaconda3/lib/python3.12/site-packages/deepspeed/ops/transformer/__init__.py", line 7, in <module>
from .inference.config import DeepSpeedInferenceConfig
File "/home/ray/anaconda3/lib/python3.12/site-packages/deepspeed/ops/transformer/inference/__init__.py", line 7, in <module>
from ....model_implementations.transformers.ds_transformer import DeepSpeedTransformerInference
File "/home/ray/anaconda3/lib/python3.12/site-packages/deepspeed/model_implementations/__init__.py", line 6, in <module>
from .transformers.ds_transformer import DeepSpeedTransformerInference
File "/home/ray/anaconda3/lib/python3.12/site-packages/deepspeed/model_implementations/transformers/ds_transformer.py", line 18, in <module>
from deepspeed.ops.transformer.inference.triton.mlp import TritonMLP
File "/home/ray/anaconda3/lib/python3.12/site-packages/deepspeed/ops/transformer/inference/triton/__init__.py", line 10, in <module>
from .ops import *
File "/home/ray/anaconda3/lib/python3.12/site-packages/deepspeed/ops/transformer/inference/triton/ops.py", line 6, in <module>
import deepspeed.ops.transformer.inference.triton.matmul_ext as matmul_ext
File "/home/ray/anaconda3/lib/python3.12/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 10, in <module>
import deepspeed.ops.transformer.inference.triton.triton_matmul_kernel as triton_matmul_kernel
File "/home/ray/anaconda3/lib/python3.12/site-packages/deepspeed/ops/transformer/inference/triton/triton_matmul_kernel.py", line 51, in <module>
@triton.autotune(
^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 368, in decorator
return Autotuner(fn, fn.arg_names, configs, key, reset_to_zero, restore_value, pre_hook=pre_hook,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 130, in __init__
self.do_bench = driver.active.get_benchmarker()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.12/site-packages/triton/runtime/driver.py", line 23, in __getattr__
self._initialize_obj()
File "/home/ray/anaconda3/lib/python3.12/site-packages/triton/runtime/driver.py", line 20, in _initialize_obj
self._obj = self._init_fn()
^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.12/site-packages/triton/runtime/driver.py", line 8, in _create_driver
raise RuntimeError(f"{len(actives)} active drivers ({actives}). There should only be one.")
RuntimeError: 0 active drivers ([]). There should only be one.
The part where Ray Train becomes an issue is the fact that the user defined train_fn, which captures the deepspeed import by referencing it somewhere in the train fn code, gets deserialized on a CPU Ray actor (the Controller in Ray Train V2 / the Tune Trainable in V1) which doesn’t have CUDA_VISIBLE_DEVICES set. This CPU Ray actor is the internal driver process that launches and monitors workers.
Even though the task is being run on a GPU node, the CPU Ray actor results in the same error as importing on a CPU node.
⚠️ Workaround: Move all deepspeed imports into the train function.
This mitigates the issue because deepspeed never needs to be imported on the CPU Controller actor, since it’s not captured in the pickle scope.
# import deepspeed # This results in an error
def train_fn(config):
import deepspeed
...
TorchTrainer(train_fn, ...)
✅ Deepspeed actually fixed the cpu import issue in 0.16.4
deepspeed>=0.16.4 has gated Triton import to happen only when it’s installed and when the accelerator supports it. So, deepspeed won’t try to import triton if 1get_accelerator() == CPU`. This fixes the import error when the detected device is CPU. See https://github.com/deepspeedai/DeepSpeed/pull/6989