mmdetection H100 GPU RuntimeError: CUDA error: no kernel image is available for execution on the device

Checklist

I have searched related issues but cannot get the expected help.
I have read the FAQ documentation but cannot get the expected help.
The bug has not been fixed in the latest version.

Describe the bug I tried to infer using rtmdet but after installing environment in my workstation with H100 GPU, I found this error when I try to infer:

RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Reproduction I followed these steps to create the environment:

conda create -n rtmdet-env python=3.9
conda activate rtmdet-env

pip install torch==2.0.0+cu118 torchvision==0.15.1+cu118 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu118
pip install -U openmim
mim install mmengine
mim install "mmcv>=2.0.0"
mim install mmdet
mim download mmdet --config rtmdet-ins_x_8xb16-300e_coco --dest .

Run these on python terminal to reproduce the error:

from mmdet.apis import init_detector, inference_detector

config_file = 'rtmdet_tiny_8xb32-300e_coco.py'
checkpoint_file = 'rtmdet_tiny_8xb32-300e_coco_20220902_112414-78e30dcc.pth'
model = init_detector(config_file, checkpoint_file, device='cpu')  # or device='cuda:0'
inference_detector(model, 'demo/demo.jpg')

Did you make any modifications on the code or config? Did you understand what you have modified? No, I did not make any modifications on the code or config.
What dataset did you use? I only wanted to infer using some demo images. So, did not use any kind of datasets. Environment
Please run python mmdet/utils/collect_env.py to collect necessary environment information and paste it here.

sys.platform: linux
Python: 3.9.18 (main, Sep 11 2023, 13:41:44) [GCC 11.2.0]
CUDA available: True
numpy_random_seed: 2147483648
GPU 0: NVIDIA H100 PCIe
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.105
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
PyTorch: 2.0.0+cu118
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.7.3 (Git Hash 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.8
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.7
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.8, CUDNN_VERSION=8.7.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.0.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

TorchVision: 0.15.1+cu118
OpenCV: 4.8.1
MMEngine: 0.9.0
MMDetection: 3.2.0+fe3f809

You may add addition that may be helpful for locating the problem, such as
- How you installed PyTorch [e.g., pip, conda, source] pip install torch==2.0.0+cu118 torchvision==0.15.1+cu118 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu118
- Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.) I have already shared how I created the full environment.

Bug fix I could run the same model with same environment creation process in 4090 GPU workstation. From this link https://mmdetection.readthedocs.io/en/latest/get_started.html#cuda-versions I found that there is instruction for Ampere based GPU and older GPUs but no mention of Hopper based GPU. Since the bug occurs in Hopper based GPU, I think most probably it is because of the new architecture not being yet supported by mmcv. Can you look into this?

Oct 16 '23 07:10 mdk19015

@hiperdyne19015 Sorry, I don't have an H100, and I can't perform the test either.

Oct 16 '23 10:10 hhaAndroid

@hhaAndroid I installed pytorch and it works properly with H100 GPU. Is there any source code or dependency for MMDetection to use GPU other than pytorch? Do libraries call cuda function directly? Can I fix it myself or try to make it run on H100 GPU? TIA

Oct 17 '23 04:10 mdk19015

I have a similar error, but only when attempting multi-gpu distributed runs. this is my env

sys.platform: linux Python: 3.8.18 (default, Sep 11 2023, 13:40:15) [GCC 11.2.0] CUDA available: True numpy_random_seed: 2147483648 GPU 0: Tesla V100-SXM2-16GB CUDA_HOME: /home/xxxx/anaconda3/envs/universal NVCC: Cuda compilation tools, release 11.7, V11.7.64 GCC: gcc (GCC) 10.2.0 PyTorch: 2.0.1+cu117 PyTorch compiling details: PyTorch built with:

GCC 9.3
C++ Version: 201703
Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v2.7.3 (Git Hash 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 11.7
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
CuDNN 8.5
Magma 2.6.1
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.7, CUDNN_VERSION=8.5.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.0.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

TorchVision: 0.15.2+cu117 OpenCV: 4.8.1 MMEngine: 0.9.0 MMDetection: 3.2.0+fe3f809

I am running grounding_dino with the r50 and swin-b arch. it runs fine with the tools/train.py with gpu=1 option, when i try the tools/dist_train.sh with 4 or 8 gpus, i get the CUDA error , error in ms_deformable_col2im_cuda: no kernel image is available for execution on the device error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device

Nov 08 '23 18:11 sheethalb

Hi @mdk19015 ,

Has there been any update on this? I am also trying to get BEVFusion going on H100 and am stuck.

Thanks!

Nov 30 '23 20:11 flstahl

@flstahl Sorry but there is no update on this issue from my side. I am also stuck currently.

Dec 04 '23 00:12 mdk19015

Hi @mdk19015 ,

Thanks for getting back to me. I was able to resolve the issue on my end. Since I am using AWS EC2 instances (p5.48xlarge) that has the H100, the solution might be slightly more complex for you (depending on your set-up).

The steps that I took to resolve it:

Launch EC2 instance with https://aws.amazon.com/releasenotes/aws-deep-learning-base-gpu-ami-ubuntu-20-04/ the Ubuntu 20.04 Deep Learning Base AMI (this does not have any pytorch installed)
Switch CUDA version from 12.1 (enabled by default with this AMI) to 11.8
Install Anaconda
Create virtual conda env with python 3.10 enabled (I found that python>3.10 did not work)
Build pytorch 1.13 from source with CUDA support (following their explanation https://github.com/pytorch/pytorch#from-source)
Build mmcv, mmengine, mmdetection and mmdetection3d from source
In the setup.py file inside the BEVFusion directory in mmdetection3d/projects/BEVFusion, delete rows 26-29 and instead add a new row with sm_90 to support Nvidia Hopper
Build the BEVFusion components as described on their README

This works for me. You will still get warnings from pytorch that there was no support for H100 but you can ignore this. This is probably because there is no official pytorch 1.13 version with CUDA 11.8 that supports Nvidia Hopper.

However, according to pytorch documentation, Nvidia documentation and my experience, CUDA 11.8 (and above) are working fine with torch 1.13.

I assume you should also be able to do this with torch 2.0 or 2.1 but I did not try myself, yet.

Let me know if you have any questions and if this works for you!

Dec 04 '23 12:12 flstahl

Thank you @flstahl ! Following your ideas, I solved this problem on my side. Problem: error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device Environment: H100, Pytorch==2.0.1 with cuda11.8, torchversion==0.15.2 Solution: download mmcv, in the setup.py, modify below line in get_extensions():

extra_compile_args = {
            # 'nvcc': [cuda_args, '-std=c++14'] if cuda_args else ['-std=c++14'],
            'nvcc': [cuda_args, '-std=c++14', '-arch=sm_90'] if cuda_args else ['-std=c++14'],
            'cxx': ['-std=c++14'],
        }

And then compile pip install -v -e .

Dec 04 '23 19:12 helq2612

Hi @helq2612 ,

Happy to hear that you were able to resolve the issue and that you may have found some useful hints in my experience. Good team work!

Dec 04 '23 22:12 flstahl

Thank you @flstahl ! Following your ideas, I solved this problem on my side. Problem: error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device Environment: H100, Pytorch==2.0.1 with cuda11.8, torchversion==0.15.2 Solution: download mmcv, in the setup.py, modify below line in get_extensions():
extra_compile_args = {
            # 'nvcc': [cuda_args, '-std=c++14'] if cuda_args else ['-std=c++14'],
            'nvcc': [cuda_args, '-std=c++14', '-arch=sm_90'] if cuda_args else ['-std=c++14'],
            'cxx': ['-std=c++14'],
        }
And then compile pip install -v -e .

Hi, thanks for your solution, I follow your steps, can I know your mmcv version? I failed to compile the MMCV.

Oct 20 '24 05:10 lanqz7766

谢谢！按照您的想法，我这边解决了这个问题。问题：ms_deformable_im2col_cuda报错：没有内核镜像可在设备上执行环境：H100，Pytorch==2.0.1 with cuda11.8，torchversion==0.15.2 解决方案：下载 mmcv，在 setup.py 中，修改 get_extensions（）中的以下行：
extra_compile_args = {
            # 'nvcc': [cuda_args, '-std=c++14'] if cuda_args else ['-std=c++14'],
            'nvcc': [cuda_args, '-std=c++14', '-arch=sm_90'] if cuda_args else ['-std=c++14'],
            'cxx': ['-std=c++14'],
        }
然后编译 pip install -v -e .

非常有效，感谢！

Nov 17 '24 16:11 L1NINE

Thank you for your great information. I hope the community release the new version of mmcv and mmdet for sm_90 GPUs.

Nov 20 '24 07:11 toshi-k

Same issue here on the h100

Nov 24 '24 20:11 michaelmohamed

Hi @flstahl,

Thanks for sharing your great solution!

I’ve been trying to reproduce some results using the following setup:

cuda==11.8
pytorch==1.13.0
mmcv==1.7.1
mmdet3d==0.17.1

While everything works fine with a single GPU, I ran into what seems to be a deadlock issue when trying to use multiple GPUs. I suspect it might be related to GPU synchronization, but I’m not entirely sure.

Could you share the specific versions of the following libraries that you used in your setup? It would really help me align my environment with yours:

mmcv
mmengine
mmdetection
mmdetection3d

Thanks a lot for your time and help!

Nov 28 '24 14:11 TomoyaFukui

谢谢！按照您的想法，我这边解决了这个问题。问题：ms_deformable_im2col_cuda报错：没有内核镜像可在设备上执行环境：H100，Pytorch==2.0.1 with cuda11.8，torchversion==0.15.2 解决方案：下载 mmcv，在 setup.py 中，修改 get_extensions（）中的以下行：
extra_compile_args = {
            # 'nvcc': [cuda_args, '-std=c++14'] if cuda_args else ['-std=c++14'],
            'nvcc': [cuda_args, '-std=c++14', '-arch=sm_90'] if cuda_args else ['-std=c++14'],
            'cxx': ['-std=c++14'],
        }
然后编译 pip install -v -e .
非常有效，感谢！

I can't find the get_extensions（）, could you tell me your mmdetection version?

Jan 04 '25 08:01 Wild-Stephen

MMCV_CUDA_ARGS="-arch=sm_89" pip install -v -e . works for me. Perhaps we don't need to edit setup.py

Feb 21 '25 01:02 zakki

Thank you @flstahl ! Following your ideas, I solved this problem on my side. Problem: error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device Environment: H100, Pytorch==2.0.1 with cuda11.8, torchversion==0.15.2 Solution: download mmcv, in the setup.py, modify below line in get_extensions():
extra_compile_args = {
            # 'nvcc': [cuda_args, '-std=c++14'] if cuda_args else ['-std=c++14'],
            'nvcc': [cuda_args, '-std=c++14', '-arch=sm_90'] if cuda_args else ['-std=c++14'],
            'cxx': ['-std=c++14'],
        }
And then compile pip install -v -e .

Which versions of mmcv, mmdetection, mmdetection3d did you use?

Sep 08 '25 18:09 alexngUNC

I have found the solution for my case (not one of the above), so I hope this might be useful.

In my case, the traceback shows that the error comes from the mmcv library.

python demo/image_demo.py demo/demo.jpg rtmdet-s Traceback (most recent call last): >File "demo/image_demo.py", line 192, in > main()
... ... File "/home/myname/miniconda3/envs/mmdetection/lib/python3.8/site-packages/mmcv/ops/nms.py", line 27, in forward inds = ext_module.nms( RuntimeError: CUDA error: no kernel image is available for execution on the device CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

So I doubted that the compiled version of mmcv does not support sm_90 (H100 GPU). Therefore, I think I need to install it from source, with some new config to tell it to support sm_90. What worked for me:

MMCV_CUDA_ARGS='-gencode=arch=compute_90,code=sm_90' pip install mmcv==2.1.* --no-binary mmcv

Voila! Hope this helps

My solution was inspired by an MMDetection FAQ

Temporary work-around: do MMCV_WITH_OPS=1 MMCV_CUDA_ARGS='-gencode=arch=compute_80,code=sm_80' pip install -e ... This work-around modifies the compile flag by adding MMCV_CUDA_ARGS='-gencode=arch=compute_80,code=sm_80', which tells nvcc to optimize for sm_80

Oct 27 '25 09:10 nguyenthekhoig7