DeepSpeed [BUG] RuntimeError: Error building extension 'inference_core

Describe the bug I am trying to run the non-persistent example given for mistralai/Mistral-7B-Instruct-v0.3 on a RTX A6000 GPU (on a server) so compute capability is met, ubuntu is 22.04, CUDA toolkit is 11.5 (I am not a sudoer of the server so I am not able to upgrade the toolkit, instead I have created a conda environment and installed CUDA toolkit 11.8). On running the python3 pipeline.py command I am running into the error: RuntimeError: Error building extension 'inference_core_ops'

To Reproduce Steps to reproduce the behavior:

conda create -n my_env python=3.12.4 cudatoolkit=11.8
pip install deepspeed-mii (in the conda environment with CUDA toolkit 11.8)
https://github.com/microsoft/DeepSpeedExamples/blob/master/inference/mii/non-persistent/pipeline.py
What packages are required and their versions: NVIDIA GPU(s) with compute capability of: 8.0, 8.6, 8.9, 9.0. CUDA 11.6+ Ubuntu 20+
python3 pipeline.py or deepspeed --num_gpus 1 --no_local_rank pipeline.py

ds_report output

(deep) (base) cpatil@meherangarh:/data1/cpatil/simplismart$ ds_report [2024-09-10 13:49:31,073] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

[WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] fused_adam ............. [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] cpu_adagrad ............ [NO] ....... [OKAY] cpu_lion ............... [NO] ....... [OKAY] [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH evoformer_attn ......... [NO] ....... [NO] [WARNING] FP Quantizer is using an untested triton version (2.2.0), only 2.3.(0, 1) and 3.0.0 are known to be compatible with these kernels fp_quantizer ........... [NO] ....... [NO] fused_lamb ............. [NO] ....... [OKAY] fused_lion ............. [NO] ....... [OKAY] x86_64-linux-gnu-gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -c /tmp/tmphfej6tlu/test.c -o /tmp/tmphfej6tlu/test.o x86_64-linux-gnu-gcc /tmp/tmphfej6tlu/test.o -L/usr -lcufile -o /tmp/tmphfej6tlu/a.out /usr/bin/ld: cannot find -lcufile: No such file or directory collect2: error: ld returned 1 exit status gds .................... [NO] ....... [NO] inference_core_ops ..... [NO] ....... [OKAY] cutlass_ops ............ [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] ragged_device_ops ...... [NO] ....... [OKAY] ragged_ops ............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2 [WARNING] using untested triton version (2.2.0), only 1.0.0 is known to be compatible sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/home/cpatil/.local/lib/python3.10/site-packages/torch'] torch version .................... 2.2.2+cu121 deepspeed install path ........... ['/home/cpatil/.local/lib/python3.10/site-packages/deepspeed'] deepspeed info ................... 0.15.1, unknown, unknown torch cuda version ............... 12.1 torch hip version ................ None nvcc version ..................... 11.5 deepspeed wheel compiled w. ...... torch 2.2, cuda 12.1 shared memory (/dev/shm) size .... 503.87 GB

Output on running the command (deep) (base) cpatil@meherangarh:/data1/cpatil/simplismart$ python3 pipeline.py [2024-09-10 13:43:54,824] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-10 13:43:56,883] [INFO] [comm.py:652:init_distributed] cdb=None [2024-09-10 13:43:56,884] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl Fetching 11 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 79410.23it/s] [2024-09-10 13:43:57,612] [INFO] [engine_v2.py:82:init] Building model... Using /home/cpatil/.cache/torch_extensions/py312_cu121 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/cpatil/.cache/torch_extensions/py312_cu121/inference_core_ops/build.ninja... /home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/torch/utils/cpp_extension.py:1965: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST']. warnings.warn( Building extension module inference_core_ops... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/2] /usr/bin/nvcc --generate-dependencies-with-compile --dependency-output linear_kernels_cuda.cuda.o.d -DTORCH_EXTENSION_NAME=inference_core_ops -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/deepspeed/inference/v2/kernels/core_ops/bias_activations -I/home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/deepspeed/inference/v2/kernels/core_ops/blas_kernels -I/home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/deepspeed/inference/v2/kernels/core_ops/cuda_layer_norm -I/home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/deepspeed/inference/v2/kernels/core_ops/cuda_rms_norm -I/home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/deepspeed/inference/v2/kernels/core_ops/gated_activations -I/home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/deepspeed/inference/v2/kernels/core_ops/cuda_linear -I/home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/deepspeed/inference/v2/kernels/includes -isystem /home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/torch/include -isystem /home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -isystem /home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/torch/include/TH -isystem /home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/torch/include/THC -isystem /home/cpatil/miniconda3/envs/deep/include/python3.12 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ --threads=8 -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -c /home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/deepspeed/inference/v2/kernels/core_ops/cuda_linear/linear_kernels_cuda.cu -o linear_kernels_cuda.cuda.o FAILED: linear_kernels_cuda.cuda.o /usr/bin/nvcc --generate-dependencies-with-compile --dependency-output linear_kernels_cuda.cuda.o.d -DTORCH_EXTENSION_NAME=inference_core_ops -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/deepspeed/inference/v2/kernels/core_ops/bias_activations -I/home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/deepspeed/inference/v2/kernels/core_ops/blas_kernels -I/home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/deepspeed/inference/v2/kernels/core_ops/cuda_layer_norm -I/home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/deepspeed/inference/v2/kernels/core_ops/cuda_rms_norm -I/home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/deepspeed/inference/v2/kernels/core_ops/gated_activations -I/home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/deepspeed/inference/v2/kernels/core_ops/cuda_linear -I/home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/deepspeed/inference/v2/kernels/includes -isystem /home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/torch/include -isystem /home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -isystem /home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/torch/include/TH -isystem /home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/torch/include/THC -isystem /home/cpatil/miniconda3/envs/deep/include/python3.12 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ --threads=8 -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -c /home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/deepspeed/inference/v2/kernels/core_ops/cuda_linear/linear_kernels_cuda.cu -o linear_kernels_cuda.cuda.o /home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/deepspeed/inference/v2/kernels/core_ops/cuda_linear/include/ptx_mma.cuh(59): warning #174-D: expression has no effect

/home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/deepspeed/inference/v2/kernels/core_ops/cuda_linear/include/ptx_mma.cuh(135): warning #174-D: expression has no effect

/home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/deepspeed/inference/v2/kernels/core_ops/cuda_linear/include/ptx_cp.async.cuh(33): warning #174-D: expression has no effect

/home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/deepspeed/inference/v2/kernels/core_ops/cuda_linear/include/ptx_cp.async.cuh(44): warning #174-D: expression has no effect

/home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/deepspeed/inference/v2/kernels/core_ops/cuda_linear/include/ptx_cp.async.cuh(56): warning #174-D: expression has no effect

/home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/deepspeed/inference/v2/kernels/core_ops/cuda_linear/include/ptx_cp.async.cuh(70): warning #174-D: expression has no effect

/home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/deepspeed/inference/v2/kernels/core_ops/cuda_linear/include/kernel_matmul.cuh(268): warning #174-D: expression has no effect

/usr/include/c++/11/bits/std_function.h:435:145: error: parameter packs not expanded with ‘...’: 435 | function(_Functor&& __f) | ^ /usr/include/c++/11/bits/std_function.h:435:145: note: ‘_ArgTypes’ /usr/include/c++/11/bits/std_function.h:530:146: error: parameter packs not expanded with ‘...’: 530 | operator=(_Functor&& __f) | ^ /usr/include/c++/11/bits/std_function.h:530:146: note: ‘_ArgTypes’ ninja: build stopped: subcommand failed. [rank0]: Traceback (most recent call last): [rank0]: File "/home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 2105, in _run_ninja_build [rank0]: subprocess.run( [rank0]: File "/home/cpatil/miniconda3/envs/deep/lib/python3.12/subprocess.py", line 571, in run [rank0]: raise CalledProcessError(retcode, process.args, [rank0]: subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

[rank0]: The above exception was the direct cause of the following exception:

[rank0]: Traceback (most recent call last): [rank0]: File "/data1/cpatil/simplismart/pipeline.py", line 12, in [rank0]: pipe = mii.pipeline(args.model) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/mii/api.py", line 231, in pipeline [rank0]: inference_engine = load_model(model_config) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/mii/modeling/models.py", line 17, in load_model [rank0]: inference_engine = build_hf_engine( [rank0]: ^^^^^^^^^^^^^^^^ [rank0]: File "/home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/deepspeed/inference/v2/engine_factory.py", line 135, in build_hf_engine [rank0]: return InferenceEngineV2(policy, engine_config) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/deepspeed/inference/v2/engine_v2.py", line 83, in init [rank0]: self._model = self._policy.build_model(self._config, self._base_mp_group) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/deepspeed/inference/v2/model_implementations/inference_policy_base.py", line 156, in build_model [rank0]: self.model = self.instantiate_model(engine_config, mp_group) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/deepspeed/inference/v2/model_implementations/mistral/policy.py", line 17, in instantiate_model [rank0]: return MistralInferenceModel(config=self._model_config, engine_config=engine_config, base_mp_group=mp_group) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/deepspeed/inference/v2/model_implementations/inference_transformer_base.py", line 215, in init [rank0]: self.make_norm_layer() [rank0]: File "/home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/deepspeed/inference/v2/model_implementations/inference_transformer_base.py", line 518, in make_norm_layer [rank0]: self.norm = heuristics.instantiate_pre_norm(norm_config, self._engine_config) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/deepspeed/inference/v2/modules/heuristics.py", line 176, in instantiate_pre_norm [rank0]: return DSPreNormRegistry.instantiate_config(config) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/deepspeed/inference/v2/modules/module_registry.py", line 36, in instantiate_config [rank0]: if not target_implementation.supports_config(config_bundle.config): [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/deepspeed/inference/v2/modules/implementations/pre_norm/cuda_pre_rms.py", line 36, in supports_config [rank0]: _ = CUDARMSPreNorm(config.channels, config.residual_dtype) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/deepspeed/inference/v2/kernels/core_ops/cuda_rms_norm/rms_norm_base.py", line 36, in init [rank0]: self.inf_module = InferenceCoreBuilder().load() [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/deepspeed/ops/op_builder/builder.py", line 531, in load [rank0]: return self.jit_load(verbose) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/deepspeed/ops/op_builder/builder.py", line 578, in jit_load [rank0]: op_module = load(name=self.name, [rank0]: ^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 1312, in load [rank0]: return _jit_compile( [rank0]: ^^^^^^^^^^^^^ [rank0]: File "/home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 1722, in _jit_compile [rank0]: _write_ninja_file_and_build_library( [rank0]: File "/home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 1834, in _write_ninja_file_and_build_library [rank0]: _run_ninja_build( [rank0]: File "/home/cpatil/miniconda3/envs/deep/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 2121, in _run_ninja_build [rank0]: raise RuntimeError(message) from e [rank0]: RuntimeError: Error building extension 'inference_core_ops' [rank0]:[W910 13:45:07.069688599 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())

System info (please complete the following information):

OS: Ubuntu 22.04
GPU count and types: Single GPU, RTX A6000, 48GB (Compute Capability 8.6)
DeepSpeed-MII version: 0.3.0
Python version: 3.12.4

Additional context I am running the pipeline.py script on a server with CUDA toolkit version 11.5, since I am not a sudoer I have instead created a conda env with toolkit version 11.8.

Sep 10 '24 08:09 Chetan3200

Hi @Chetan3200 - thanks for the report. A question, if you try to just run the following in your conda env:

DS_BUILD_OPS=1 pip install deepspeed

Do you get this same error?

Same question for

DS_BUILD_INFERENCE_CORE_OPS=1 pip install deepspeed

I suspect both will fail, but good to know. Either way, the error seems to be "/usr/include/c++/11/bits/std_function.h:435:145: error: parameter packs not expanded with ‘...’:" but we do have an A6000 to test on so I will try there, but this seems likely to be a cuda/c++ version issue rather than a DeepSpeed one?

Oct 09 '24 17:10 loadams

Hi @Chetan3200 - curious if you are still hitting this issue?

Oct 31 '24 17:10 loadams

Hi @Chetan3200 - closing this issue for now. I'm not able to repro this on our hardware.

Nov 15 '24 22:11 loadams

deepspeed 报错[Unable to build extension "transformer_inference"] 应该就两个原因： 1 root用户安装了conda或者python的环境，在root目录中，需要修改环境配置，让 TORCH_EXTENSIONS_DIR=/tmp 因为/tmp目录没有/root要求严格。 2 conda的虚拟环境安装的deepspeed，有个bug，没有cuda的包含路径和库路径，需要手动指定： export CPATH=$CONDA_PREFIX/targets/x86_64-linux/include:$CPATH export LD_LIBRARY_PATH=$CONDA_PREFIX/targets/x86_64-linux/lib:$LD_LIBRARY_PATH

我遇到过两次，都解决了，就这两个原因。

Oct 08 '25 03:10 tiger3927

[BUG] RuntimeError: Error building extension 'inference_core_ops'

(deep) (base) cpatil@meherangarh:/data1/cpatil/simplismart$ ds_report [2024-09-10 13:49:31,073] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible