DeepSpeed [BUG] `import deepspeed` crashes on `deepspeed==0.16.3` with `triton==3.2.0` on CPU machine

Describe the bug A clear and concise description of what the bug is.

deepspeed uses the @triton.autotuner decorator, which leads to the autotuner being initialized when import deepspeed happens
in triton 3.2.0, they add logic to the autotuner that leads to a check for torch.cuda.is_available() in the autotuner constructor

Before this updates of triton, it's safe to import deepspeed on a CPU machine.

To Reproduce run import deepspeed on a CPU machine will lead to the following error message:

>>> import deepspeed
[2025-02-12 18:28:06,516] [WARNING] [real_accelerator.py:181:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.
[2025-02-12 18:28:06,530] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cpu (auto detect)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/__init__.py", line 25, in <module>
    from . import ops
  File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/__init__.py", line 11, in <module>
    from . import transformer
  File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/transformer/__init__.py", line 7, in <module>
    from .inference.config import DeepSpeedInferenceConfig
  File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/__init__.py", line 7, in <module>
    from ....model_implementations.transformers.ds_transformer import DeepSpeedTransformerInference
  File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/model_implementations/__init__.py", line 6, in <module>
    from .transformers.ds_transformer import DeepSpeedTransformerInference
  File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/model_implementations/transformers/ds_transformer.py", line 18, in <module>
    from deepspeed.ops.transformer.inference.triton.mlp import TritonMLP
  File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/__init__.py", line 10, in <module>
    from .ops import *
  File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/ops.py", line 6, in <module>
    import deepspeed.ops.transformer.inference.triton.matmul_ext as matmul_ext
  File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 10, in <module>
    import deepspeed.ops.transformer.inference.triton.triton_matmul_kernel as triton_matmul_kernel
  File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/triton_matmul_kernel.py", line 120, in <module>
    def _fp_matmul(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 368, in decorator
    return Autotuner(fn, fn.arg_names, configs, key, reset_to_zero, restore_value, pre_hook=pre_hook,
  File "/home/ray/anaconda3/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 130, in __init__
    self.do_bench = driver.active.get_benchmarker()
  File "/home/ray/anaconda3/lib/python3.9/site-packages/triton/runtime/driver.py", line 23, in __getattr__
    self._initialize_obj()
  File "/home/ray/anaconda3/lib/python3.9/site-packages/triton/runtime/driver.py", line 20, in _initialize_obj
    self._obj = self._init_fn()
  File "/home/ray/anaconda3/lib/python3.9/site-packages/triton/runtime/driver.py", line 8, in _create_driver
    raise RuntimeError(f"{len(actives)} active drivers ({actives}). There should only be one.")
RuntimeError: 0 active drivers ([]). There should only be one.

Expected behavior A clear and concise description of what you expected to happen.

import deepspeed on a CPU mahcine should not crash.

ds_report output Please run ds_report to give us details about your setup.

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

OS: [e.g. Ubuntu 18.04]
GPU count and types [e.g. two machines with x8 A100s each]
Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
Python version
Any other relevant info about your setup

Launcher context Are you launching your experiment with the deepspeed launcher, MPI, or something else?

Docker context Are you using a specific docker image that you can share?

Additional context Add any other context about the problem here.

Feb 13 '25 02:02 hongpeng-guo

@hongpeng-guo, thanks for reporting this. However, I think something else is going since

I am unable to repro this issue using DeepSpeed master. Please see below.
There is a guard to avoid triton import on cpu accelerator

Please share the result of your ds_report

2025-02-12 22:57:55,224] [WARNING] [real_accelerator.py:181:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.
[2025-02-12 22:57:55,231] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cpu (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
deepspeed_not_implemented  [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
deepspeed_ccl_comm ..... [NO] ....... [OKAY]
deepspeed_shm_comm ..... [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/tjruwase/py_venv/torch_2_cpu/lib/python3.12/site-packages/torch']
torch version .................... 2.4.1+cpu
deepspeed install path ........... ['/home/tjruwase/py_venv/torch_2_cpu/lib/python3.12/site-packages/deepspeed']
deepspeed info ................... 0.16.4+079de6bd, 079de6bd, master
deepspeed wheel compiled w. ...... torch 0.0 
shared memory (/dev/shm) size .... 125.77 GB
(torch_2_cpu) tjruwase@IronLambda:~/projects/DeepSpeed/public/master$ python 
Python 3.12.3 (main, Jan 17 2025, 18:03:48) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import deepspeed
[2025-02-12 22:58:03,304] [WARNING] [real_accelerator.py:181:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.
[2025-02-12 22:58:03,312] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cpu (auto detect)
[2025-02-12 22:58:03,403] [WARNING] [real_accelerator.py:181:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.
[2025-02-12 22:58:03,405] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cpu (auto detect)
from>>> from deepspeed import get_accelerator
>>> get_accelerator()._name
'cpu'
>>>

Feb 13 '25 04:02 tjruwase

@tjruwase Thanks for your prompt reply. I think it is a joint issue with DeepSpeed and triton. The issue happens on triton==3.2.0, the latest version of triton. Could you try to upgrade triton of your environment and try it again? Let me see how to run ds_report on my side. I think the guard looks cool, adding another one on the .ops path might solve the problem.

Feb 13 '25 05:02 hongpeng-guo

@tjruwase btw, here is what I saw when running ds_report on a pure CPU node. It can be reproed with triton==3.2.0. I think it doesn't suggest there was a bug in deepspeed. But some changes of the recent triton makes deepspeed import fail on pure CPU node.

base) ray@ip-10-0-6-136:~/default$ ds_report
[2025-02-13 18:01:48,138] [WARNING] [real_accelerator.py:181:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.
[2025-02-13 18:01:48,156] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cpu (auto detect)
Traceback (most recent call last):
  File "/home/ray/anaconda3/bin/ds_report", line 3, in <module>
    from deepspeed.env_report import cli_main
  File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/__init__.py", line 25, in <module>
    from . import ops
  File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/__init__.py", line 11, in <module>
    from . import transformer
  File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/transformer/__init__.py", line 7, in <module>
    from .inference.config import DeepSpeedInferenceConfig
  File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/__init__.py", line 7, in <module>
    from ....model_implementations.transformers.ds_transformer import DeepSpeedTransformerInference
  File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/model_implementations/__init__.py", line 6, in <module>
    from .transformers.ds_transformer import DeepSpeedTransformerInference
  File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/model_implementations/transformers/ds_transformer.py", line 18, in <module>
    from deepspeed.ops.transformer.inference.triton.mlp import TritonMLP
  File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/__init__.py", line 10, in <module>
    from .ops import *
  File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/ops.py", line 6, in <module>
    import deepspeed.ops.transformer.inference.triton.matmul_ext as matmul_ext
  File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 10, in <module>
    import deepspeed.ops.transformer.inference.triton.triton_matmul_kernel as triton_matmul_kernel
  File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/triton_matmul_kernel.py", line 120, in <module>
    def _fp_matmul(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 368, in decorator
    return Autotuner(fn, fn.arg_names, configs, key, reset_to_zero, restore_value, pre_hook=pre_hook,
  File "/home/ray/anaconda3/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 130, in __init__
    self.do_bench = driver.active.get_benchmarker()
  File "/home/ray/anaconda3/lib/python3.9/site-packages/triton/runtime/driver.py", line 23, in __getattr__
    self._initialize_obj()
  File "/home/ray/anaconda3/lib/python3.9/site-packages/triton/runtime/driver.py", line 20, in _initialize_obj
    self._obj = self._init_fn()
  File "/home/ray/anaconda3/lib/python3.9/site-packages/triton/runtime/driver.py", line 8, in _create_driver
    raise RuntimeError(f"{len(actives)} active drivers ({actives}). There should only be one.")
RuntimeError: 0 active drivers ([]). There should only be one.

Feb 14 '25 02:02 hongpeng-guo

@hongpeng-guo - I was not able to repro this on a GPU node:

annotated-types          0.7.0
deepspeed                0.16.4
einops                   0.8.1
filelock                 3.13.1
fsspec                   2024.6.1
hjson                    3.1.0
Jinja2                   3.1.3
MarkupSafe               2.1.5
mpmath                   1.3.0
msgpack                  1.1.0
networkx                 3.3
ninja                    1.11.1.4
numpy                    2.1.2
nvidia-cublas-cu12       12.6.4.1
nvidia-cuda-cupti-cu12   12.6.80
nvidia-cuda-nvrtc-cu12   12.6.77
nvidia-cuda-runtime-cu12 12.6.77
nvidia-cudnn-cu12        9.5.1.17
nvidia-cufft-cu12        11.3.0.4
nvidia-curand-cu12       10.3.7.77
nvidia-cusolver-cu12     11.7.1.2
nvidia-cusparse-cu12     12.5.4.2
nvidia-cusparselt-cu12   0.6.3
nvidia-nccl-cu12         2.21.5
nvidia-nvjitlink-cu12    12.6.85
nvidia-nvtx-cu12         12.6.77
packaging                24.2
pillow                   11.0.0
pip                      23.0.1
psutil                   7.0.0
py-cpuinfo               9.0.0
pydantic                 2.11.0
pydantic_core            2.33.0
setuptools               65.5.0
sympy                    1.13.1
torch                    2.6.0+cu126
torchaudio               2.6.0+cu126
torchvision              0.21.0+cu126
tqdm                     4.67.1
triton                   3.2.0
typing_extensions        4.12.2
typing-inspection        0.4.0

I installed torch first, then triton, then deepspeed. I get no errors when installing or running ds_report.

--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
 [WARNING]  NVIDIA Inference is only supported on Ampere and newer architectures
 [WARNING]  FP Quantizer is using an untested triton version (3.2.0), only 2.3.(0, 1) and 3.0.0 are known to be compatible with these kernels
fp_quantizer ........... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
gcc -pthread -B /opt/conda/envs/ptca/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /opt/conda/envs/ptca/include -fPIC -O2 -isystem /opt/conda/envs/ptca/include -fPIC -c /tmp/tmphvvl2dz4/test.c -o /tmp/tmphvvl2dz4/test.o
gcc -pthread -B /opt/conda/envs/ptca/compiler_compat /tmp/tmphvvl2dz4/test.o -L/usr/local/cuda -L/usr/local/cuda/lib64 -lcufile -o /tmp/tmphvvl2dz4/a.out
gds .................... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.6
 [WARNING]  using untested triton version (3.2.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/scratch/azureml/cr/j/8b1e8f53b76b44119b75adb0324fadb8/exe/wd/triton/lib/python3.10/site-packages/torch']
torch version .................... 2.6.0+cu126
deepspeed install path ........... ['/scratch/azureml/cr/j/8b1e8f53b76b44119b75adb0324fadb8/exe/wd/triton/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.16.4, unknown, unknown
torch cuda version ............... 12.6
torch hip version ................ None
nvcc version ..................... 12.4
deepspeed wheel compiled w. ...... torch 0.0, cuda 0.0
shared memory (/dev/shm) size .... 723.00 GB

Could you share any other ways to repro this? Or only on a cpu node? I tried specifying the DS_ACCELERATOR=cpu but was not able to repro it that way either.

Mar 27 '25 20:03 loadams

I get no error in ds_report as well but when I try to run some scripts with verl library and ulysses paralellism I face the same error.

Apr 03 '25 21:04 mertunsall

Hi all, thanks for all the information in this thread. Here are my findings after investigating the issue:

TL;DR:

Upgrade to deepspeed>=0.16.4.
If not possible, just pip uninstall triton manually (if you don't use it).
If not possible, move your deepspeed imports inside your training function.
If not possible, we're shipping a mitigation in Ray Train V2 for the next release (enable it via RAY_TRAIN_V2_ENABLED=1).

Issue Summary

Deepspeed has a triton dependency. Triton has a CUDA dependency. Trying to import deepspeed on a process with CUDA_VISIBLE_DEVICES unset and where triton is installed (it is by default) will raise an error because deepspeed<=0.16.3 imports triton as long as it's installed.

Minimal repro on a CPU node:

>>> import deepspeed
[2025-04-28 14:27:02,257] [WARNING] [real_accelerator.py:181:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.
[2025-04-28 14:27:02,266] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cpu (auto detect)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ray/anaconda3/lib/python3.12/site-packages/deepspeed/__init__.py", line 25, in <module>
    from . import ops
  File "/home/ray/anaconda3/lib/python3.12/site-packages/deepspeed/ops/__init__.py", line 11, in <module>
    from . import transformer
  File "/home/ray/anaconda3/lib/python3.12/site-packages/deepspeed/ops/transformer/__init__.py", line 7, in <module>
    from .inference.config import DeepSpeedInferenceConfig
  File "/home/ray/anaconda3/lib/python3.12/site-packages/deepspeed/ops/transformer/inference/__init__.py", line 7, in <module>
    from ....model_implementations.transformers.ds_transformer import DeepSpeedTransformerInference
  File "/home/ray/anaconda3/lib/python3.12/site-packages/deepspeed/model_implementations/__init__.py", line 6, in <module>
    from .transformers.ds_transformer import DeepSpeedTransformerInference
  File "/home/ray/anaconda3/lib/python3.12/site-packages/deepspeed/model_implementations/transformers/ds_transformer.py", line 18, in <module>
    from deepspeed.ops.transformer.inference.triton.mlp import TritonMLP
  File "/home/ray/anaconda3/lib/python3.12/site-packages/deepspeed/ops/transformer/inference/triton/__init__.py", line 10, in <module>
    from .ops import *
  File "/home/ray/anaconda3/lib/python3.12/site-packages/deepspeed/ops/transformer/inference/triton/ops.py", line 6, in <module>
    import deepspeed.ops.transformer.inference.triton.matmul_ext as matmul_ext
  File "/home/ray/anaconda3/lib/python3.12/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 10, in <module>
    import deepspeed.ops.transformer.inference.triton.triton_matmul_kernel as triton_matmul_kernel
  File "/home/ray/anaconda3/lib/python3.12/site-packages/deepspeed/ops/transformer/inference/triton/triton_matmul_kernel.py", line 51, in <module>
    @triton.autotune(
     ^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 368, in decorator
    return Autotuner(fn, fn.arg_names, configs, key, reset_to_zero, restore_value, pre_hook=pre_hook,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 130, in __init__
    self.do_bench = driver.active.get_benchmarker()
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/triton/runtime/driver.py", line 23, in __getattr__
    self._initialize_obj()
  File "/home/ray/anaconda3/lib/python3.12/site-packages/triton/runtime/driver.py", line 20, in _initialize_obj
    self._obj = self._init_fn()
                ^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/triton/runtime/driver.py", line 8, in _create_driver
    raise RuntimeError(f"{len(actives)} active drivers ({actives}). There should only be one.")
RuntimeError: 0 active drivers ([]). There should only be one.

The part where Ray Train becomes an issue is the fact that the user defined train_fn, which captures the deepspeed import by referencing it somewhere in the train fn code, gets deserialized on a CPU Ray actor (the Controller in Ray Train V2 / the Tune Trainable in V1) which doesn’t have CUDA_VISIBLE_DEVICES set. This CPU Ray actor is the internal driver process that launches and monitors workers.

Even though the task is being run on a GPU node, the CPU Ray actor results in the same error as importing on a CPU node.

⚠️ Workaround: Move all deepspeed imports into the train function.

This mitigates the issue because deepspeed never needs to be imported on the CPU Controller actor, since it’s not captured in the pickle scope.

# import deepspeed  # This results in an error

def train_fn(config):
    import deepspeed
    ...

TorchTrainer(train_fn, ...)

✅ Deepspeed actually fixed the cpu import issue in `0.16.4`

deepspeed>=0.16.4 has gated Triton import to happen only when it’s installed and when the accelerator supports it. So, deepspeed won’t try to import triton if 1get_accelerator() == CPU`. This fixes the import error when the detected device is CPU. See https://github.com/deepspeedai/DeepSpeed/pull/6989

Apr 28 '25 22:04 justinvyu

[BUG] `import deepspeed` crashes on `deepspeed==0.16.3` with `triton==3.2.0` on CPU machine

Issue Summary

⚠️ Workaround: Move all deepspeed imports into the train function.

✅ Deepspeed actually fixed the cpu import issue in 0.16.4

✅ Deepspeed actually fixed the cpu import issue in `0.16.4`