CTranslate2 Reintroduce support for Compute Capability 5.0

Since a few months ago I was using CTranslate2 with CUDA 11 on a GTX 960M.

Recently I tried to update to CUDA 12, which is still supported for GTX 960M, and CTranslate2.

I expected the update to work, since documentation still reports compatibility with Compute Capability 3.5 (https://opennmt.net/CTranslate2/hardware_support.html) and I had no major issue updating PyTorch. Unfortunately this was not the case, since I started receiving:

RuntimeError: parallel_for failed: cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device

Apparently this is a common issue, since I saw other related issues from users of GTX 9xx GPUs: https://github.com/SYSTRAN/faster-whisper/issues/806 https://forums.developer.nvidia.com/t/runtimeerror-parallel-for-failed-cudaerrornokernelimagefordevice-no-kernel-image-is-available-for-execution-on-the-device/291404 https://github.com/m-bain/whisperX/issues/794

I tried to compile the code with Compute Cabability 5.0, but this was not enough, since some code was introduced that required Compute Capability 5.3 (which is not supported on my GPU), thus I disabled that code and recompiled. After that I was able to run through quickstart, using cuda instead of cpu. I was also able to run faster-whisper.

Is it possible to reintroduce support for Compute Capablity 5.0 in distributed wheels?

If yes I will be happy to provide a pull request.

Aug 27 '24 16:08 giuliopaci

Any news about this issue? Is it possible to reintroduce support for Compute Capability 5.0?

Dec 10 '24 11:12 giuliopaci

The first step would be to identify the commits which updated the lib' in such an incompatible way. I doubt the switch from cudnn8 -> cudnn9 was responsible for this alone, if at all.

While trying to compile (on SM 5.0), I experienced:

ptxas /tmp/tmpxft_0002225f_00000000-6_dequantize_gpu.ptx, line 208; error : Feature 'f16 arithemetic and compare instructions' requires .target sm_53 or higher ptxas fatal : Ptx assembly aborted due to errors CMake Error at ctranslate2_generated_dequantize_gpu.cu.o.Release.cmake:280 (message): Error generating file CTranslate2/python/CMakeFiles/ctranslate2.dir/src/ops/awq/./ctranslate2_generated_dequantize_gpu.cu.o

Just an hypothesis but some heavy FP-16 related stuff was introduced in https://github.com/OpenNMT/CTranslate2/pull/1651 to please A100 users. This commit was merge as part of 4.2.0.

(Sounds like this kind of huge PR no one could ever revert or amend... except the original submitter : @minhthuc2502 ?)

Jul 06 '25 04:07 drzraf

I went further bisecting commits and found the culprit: #1727 The AWQ feature is missing some guards against older cards, specifically dequantize_gpu.cu from what I can say. Since it was initially intended as a new feature, I'm optimistic it could be amended with the necessary CMakeLists.txt stanzas to avoid breaking older SM. I also wonder about that part:

https://github.com/OpenNMT/CTranslate2/commit/39f48f2e843df52245e6c857326e1115bca12b03#diff-bf8c67e8b82ba69a7fd6f2a23ab329cec486f60e29b47218b6a0db20552dcc5dR6-R11

https://github.com/OpenNMT/CTranslate2/blob/39f48f2e843df52245e6c857326e1115bca12b03/src/ops/awq/dequantize.cuh#L7-L11

As a proof, I was able to build and run v4.3.1 on a SM 5.0, GM107 940MX against CUDA 12 and CUDNN 9

$ ldd 4.3.1/*
4.3.1/_ext.cpython-312-x86_64-linux-gnu.so:
	linux-vdso.so.1 (0x00007fff07ceb000)
	libctranslate2.so.4 => not found
	libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007a1032a00000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007a1032c9b000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007a1032600000)
	/lib64/ld-linux-x86-64.so.2 (0x00007a1032e0b000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007a1032917000)
4.3.1/libctranslate2.so.4.3.1:
	linux-vdso.so.1 (0x00007ffc31ffb000)
	libgomp.so.1 => /lib/x86_64-linux-gnu/libgomp.so.1 (0x00007c332e516000)
	libcudnn.so.9 => /lib/x86_64-linux-gnu/libcudnn.so.9 (0x00007c332ce00000)
	libcublas.so.12 => /lib/x86_64-linux-gnu/libcublas.so.12 (0x00007c3326400000)
	libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007c3326000000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007c332d117000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007c332e4e6000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007c3325c00000)
	/lib64/ld-linux-x86-64.so.2 (0x00007c332e5a6000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007c332e4e1000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007c332e4dc000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007c332e4d7000)
	libcublasLt.so.12 => /lib/x86_64-linux-gnu/libcublasLt.so.12 (0x00007c3304c00000)

(actually, I failed to build 39f48f2 but built and ran 451c27b6f which translated a 2 min clip in 15 sec using CT2_CUDA_ALLOW_FP16=1 using a GM107 having 2GB of VRAM using the base model)

I wish mere mortals (not owning A100 in a basement) could get consideration from upstream, too ;)

Jul 07 '25 02:07 drzraf

Hi,

Can we have an update on this from maintainers please? I'd like to use CUDA on an old 980 Ti.

Thanks!

Sep 22 '25 12:09 Godnoken