llama.cpp won't run when built for CUDA 13
zluda_trace logs (tarball/zip file)
No response
Description
I read at https://vosen.github.io/ZLUDA/blog/zluda-update-q3-2025/ that llama.cpp should now work with ZLUDA.
I tried building llama.cpp with the CUDA backend on Fedora 42, and tried running it with CUDA. This revealed a number of issues:
- Missing libraries:
$ ldd ./bin/llama-server
linux-vdso.so.1 (0x00007f5f2ef42000)
libmtmd.so => /home/wim/src/llama.cpp/build-cuda/bin/libmtmd.so (0x00007f5f2ee87000)
libcurl.so.4 => /lib64/libcurl.so.4 (0x00007f5f2ed95000)
libllama.so => /home/wim/src/llama.cpp/build-cuda/bin/libllama.so (0x00007f5f2ea00000)
libggml.so => /home/wim/src/llama.cpp/build-cuda/bin/libggml.so (0x00007f5f2ed8a000)
libggml-cpu.so => /home/wim/src/llama.cpp/build-cuda/bin/libggml-cpu.so (0x00007f5f2e882000)
libggml-cuda.so => /home/wim/src/llama.cpp/build-cuda/bin/libggml-cuda.so (0x00007f5f2c000000)
libcuda.so.1 => /home/wim/Downloads/zluda/libcuda.so.1 (0x00007f5f2b800000)
libggml-rpc.so => /home/wim/src/llama.cpp/build-cuda/bin/libggml-rpc.so (0x00007f5f2ed72000)
libggml-base.so => /home/wim/src/llama.cpp/build-cuda/bin/libggml-base.so (0x00007f5f2eccf000)
...
libcudart.so.13 => not found
libcublas.so.13 => not found
libcublasLt.so.13 => not found
libamdhip64.so.6 => /lib64/libamdhip64.so.6 (0x00007f5f29400000)
...
- When using the missing libraries from a normal CUDA setup, the process launches but fails to initialize a CUDA device:
$ ./bin/llama-bench -m ~/.cache/llama.cpp/google_gemma-3-27b-it-qat-q4_0-gguf_gemma-3-27b-it-q4_0.gguf -ngl 999
ggml_cuda_init: failed to initialize CUDA: CUDA driver version is insufficient for CUDA runtime version
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
Steps to reproduce
My build commands for llama.cpp (running in a container with CUDA installed):
cmake -B build-cuda -DGGML_CUDA=ON -DGGML_RPC=ON -DCMAKE_CUDA_ARCHITECTURES="75;86;89"
cmake --build build-cuda/ --config Release -j10
I'm using ZLUDA v5 from the github release archive.
ZLUDA version
5
Operating System
Fedora 42
GPU
AMD Ryzen AI Max 395+ (Radeon RX 8060S)
Try the most recent build (Version 6-preview.4), this should resolve CUDA 13 failures.
As for libcudart.so.13 you should just use the NVIDIA binary, it's provided by cuda-cudart-13-0 on Ubuntu, I don't know about Fedora.
I can't run your model because I don't have a setup with this much GPU memory at hand so I'm curious about your results
I just retested and it didn't give the same error as before, but it now crashes during execution. However, I retested with the ROCm build for llama.cpp and the same thing happens there - so it might be a case of a system upgrade having broken the ROCm setup.
I should have another system later today with a 7900 XTX where I can set up a fully supported distro and hopefully test again. I will update once I've got more to report.
Ran again, on a clean Ubuntu 24.04 setup with a ROCm 7.0.1 install & ZLUDA built from source, getting the same error as on my other system (which is Fedora and ROCm 6.4), so probably not caused by the setup:
wim@ramjet:~/src/llama.cpp$ zluda-run ./build-cuda/bin/llama-bench -m ~/.cache/llama.cpp/google_gemma-3-27b-it-qat-q4_0-gguf_gemma-3-27b-it-q4_0.gguf -ngl 999
./build-cuda/bin/llama-bench: /opt/zluda/libcublas.so.13: no version information available (required by /home/wim/src/llama.cpp/build-cuda/bin/libggml-cuda.so)
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: Radeon RX 7900 XTX [ZLUDA], compute capability 8.8, VMM: no
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
/home/wim/src/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:88: CUDA error
[New LWP 58612]
[New LWP 58609]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x000072c484110813 in __GI___wait4 (pid=58629, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
#0 0x000072c484110813 in __GI___wait4 (pid=58629, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 in ../sysdeps/unix/sysv/linux/wait4.c
#1 0x000072c484770633 in ggml_print_backtrace () from /home/wim/src/llama.cpp/build-cuda/bin/libggml-base.so
#2 0x000072c4847707db in ggml_abort () from /home/wim/src/llama.cpp/build-cuda/bin/libggml-base.so
#3 0x000072c47df3fe37 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /home/wim/src/llama.cpp/build-cuda/bin/libggml-cuda.so
#4 0x000072c47e25b2ae in void launch_mul_mat_q<(ggml_type)2, 128>(ggml_backend_cuda_context&, mmq_args const&, CUstream_st*) () from /home/wim/src/llama.cpp/build-cuda/bin/libggml-cuda.so
#5 0x000072c47df6897f in ggml_cuda_mul_mat_q(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor const*, ggml_tensor*) () from /home/wim/src/llama.cpp/build-cuda/bin/libggml-cuda.so
#6 0x000072c47df519eb in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) () from /home/wim/src/llama.cpp/build-cuda/bin/libggml-cuda.so
#7 0x000072c48478b777 in ggml_backend_sched_graph_compute_async () from /home/wim/src/llama.cpp/build-cuda/bin/libggml-base.so
#8 0x000072c4848a19e1 in llama_context::graph_compute(ggml_cgraph*, bool) () from /home/wim/src/llama.cpp/build-cuda/bin/libllama.so
#9 0x000072c4848a341c in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) () from /home/wim/src/llama.cpp/build-cuda/bin/libllama.so
#10 0x000072c4848a970f in llama_context::decode(llama_batch const&) () from /home/wim/src/llama.cpp/build-cuda/bin/libllama.so
#11 0x000072c4848aa62f in llama_decode () from /home/wim/src/llama.cpp/build-cuda/bin/libllama.so
#12 0x00005dc202d041a1 in test_prompt(llama_context*, int, int, int) ()
#13 0x00005dc202d0011b in main ()
[Inferior 1 (process 58606) detached]
/usr/local/bin/zluda-run: line 12: 58606 Aborted (core dumped) "${@}"
If there are any more tests I can run or data I can collect to help debug this, please let me know.
Running with export AMD_LOG_LEVEL=4 HIP_LAUNCH_BLOCKING=1 AMD_SERIALIZE_KERNEL=3 AMD_SERIALIZE_COPY=3 gave me the following:
...
:3:hip_device.cpp :656 : 33585513589 us: [pid:68741 tid: 0x7ed36a757000] hipGetDevicePropertiesR0600 ( 0x7ffc06033a60, 0 )
:3:hip_device.cpp :658 : 33585513607 us: [pid:68741 tid: 0x7ed36a757000] hipGetDevicePropertiesR0600: Returned hipSuccess :
:3:hip_module.cpp :62 : 33585522699 us: [pid:68741 tid: 0x7ed36a757000] hipModuleLoadData ( 0x7ffc06033968, 0x5bf21340ef50 )
:3:devprogram.cpp :2621: 33585523581 us: [pid:68741 tid: 0x7ed36a757000] Using Code Object V5.
:3:hip_module.cpp :63 : 33585527536 us: [pid:68741 tid: 0x7ed36a757000] hipModuleLoadData: Returned hipSuccess :
:3:hip_module.cpp :78 : 33585531341 us: [pid:68741 tid: 0x7ed36a757000] hipModuleGetFunction ( 0x7ffc06034748, 0x5bf211879220, _Z9mul_mat_qIL9ggml_type2ELi128ELb0EEvPKcPKiS4_S4_PfS5_iiiiiiiiiiiiiiiii )
:1:hip_code_object.cpp :1174: 33585531349 us: [pid:68741 tid: 0x7ed36a757000] Cannot find the function: _Z9mul_mat_qIL9ggml_type2ELi128ELb0EEvPKcPKiS4_S4_PfS5_iiiiiiiiiiiiiiiii
:1:hip_module.cpp :88 : 33585531353 us: [pid:68741 tid: 0x7ed36a757000] Cannot find the function: _Z9mul_mat_qIL9ggml_type2ELi128ELb0EEvPKcPKiS4_S4_PfS5_iiiiiiiiiiiiiiiii for module: 0x11879220
:3:hip_module.cpp :89 : 33585531357 us: [pid:68741 tid: 0x7ed36a757000] hipModuleGetFunction: Returned hipErrorNotFound :
/home/wim/src/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:88: CUDA error
I've captured the trace logs for this here zluda-llama-cpp-gemma3.tar.gz
As I mentioned on Discord - I've built ZLUDA and llama.cpp against ROCm 7.0.1, if this is complicating things, let me know and I can try and retest using ROCm 6.x
Thanks, the problem is 100% on the ZLUDA side. We do not implement the mma.sync.aligned.m16n8k32.row.col.s32.s8.s8.s32 instruction.
Good news is that the support for mma. family of instructions is what we are working right now; bad news is that it's fairly time consuming, so I ask for your patience
As of #571 this should work correctly. With a caveat: I recommend building your llama.cpp with GGML_CUDA_FORCE_CUBLAS=1. GGML_CUDA_FORCE_CUBLAS=0 works just fine but cublas path is much faster.
I did not run a full performance benchmark yet, but from what I tried it should be marginally faster than rocm backend (bf16) or the same (integer quantizations).
I tried it on Linux with CUDA 13. ZLUDA is not compatible with ROCm 7 yet
Hi,
First of all, many thanks for the hard work, and apologies for the delay in validating this fix.
Unfortunately I am now experiencing segfaults on my laptop (Fedora 43, rocm 6.4.3, CUDA 13.1, ZLUDA 629158c, llama.cpp 10b4f82d).
I collected the zluda trace logs again in case they help:
When running with gdb attached, I get the following backtrace:
Thread 1 "llama-bench" received signal SIGSEGV, Segmentation fault.
0x00000000000000d0 in ?? ()
Missing rpms, try: dnf --enablerepo='*debug*' install cuda-cudart-13-1-debuginfo-13.1.80-1.x86_64 libdrm-amdgpu-debuginfo-2.4.125.70101-2255337.el9.x86_64
(gdb) bt
#0 0x00000000000000d0 in ?? ()
#1 0x00007fffeb23f0e8 in ?? () from /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.13
#2 0x00007fffeb2456b8 in ?? () from /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.13
#3 0x00007fffeb58f804 in __pthread_once_slow (once_control=0x7fffeb4b93d0, init_routine=0x7fffeb245670) at pthread_once.c:116
#4 0x00007fffeb58f879 in ___pthread_once (once_control=<optimized out>, init_routine=<optimized out>) at pthread_once.c:143
#5 0x00007fffeb28e369 in ?? () from /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.13
#6 0x00007fffeb242b1f in ?? () from /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.13
#7 0x00007fffeb24fbfa in cudaGetDeviceCount () from /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.13
#8 0x00007ffff0721ef0 in ggml_cuda_init() () from /home/wim/src/llama.cpp/build-zluda/bin/libggml-cuda.so.0
#9 0x00007ffff0722fad in ggml_cuda_info() () from /home/wim/src/llama.cpp/build-zluda/bin/libggml-cuda.so.0
#10 0x00007ffff07248b5 in ggml_backend_cuda_reg () from /home/wim/src/llama.cpp/build-zluda/bin/libggml-cuda.so.0
#11 0x00007ffff7fb12fe in get_reg() () from /home/wim/src/llama.cpp/build-zluda/bin/libggml.so.0
#12 0x00007ffff7fb2d15 in ggml_backend_load_best(char const*, bool, char const*) [clone .constprop.0] [clone .isra.0] ()
from /home/wim/src/llama.cpp/build-zluda/bin/libggml.so.0
#13 0x00007ffff7fb4a50 in ggml_backend_load_all_from_path () from /home/wim/src/llama.cpp/build-zluda/bin/libggml.so.0
#14 0x0000000000404f7a in main ()
I downgraded my laptop for this back to rocm 6.4, only to discover that the next PR that was merged enabled ROCm 7.x 🙃
Back on ROCm 7.1 with the latest version of ZLUDA built from source, I get no crashes but llama-bench gets stuck with 100% load on a single CPU core and no GPU load.