ZLUDA icon indicating copy to clipboard operation
ZLUDA copied to clipboard

llama.cpp won't run when built for CUDA 13

Open de-wim opened this issue 3 months ago • 7 comments

zluda_trace logs (tarball/zip file)

No response

Description

I read at https://vosen.github.io/ZLUDA/blog/zluda-update-q3-2025/ that llama.cpp should now work with ZLUDA.

I tried building llama.cpp with the CUDA backend on Fedora 42, and tried running it with CUDA. This revealed a number of issues:

  1. Missing libraries:
$ ldd ./bin/llama-server 
        linux-vdso.so.1 (0x00007f5f2ef42000)
        libmtmd.so => /home/wim/src/llama.cpp/build-cuda/bin/libmtmd.so (0x00007f5f2ee87000)
        libcurl.so.4 => /lib64/libcurl.so.4 (0x00007f5f2ed95000)
        libllama.so => /home/wim/src/llama.cpp/build-cuda/bin/libllama.so (0x00007f5f2ea00000)
        libggml.so => /home/wim/src/llama.cpp/build-cuda/bin/libggml.so (0x00007f5f2ed8a000)
        libggml-cpu.so => /home/wim/src/llama.cpp/build-cuda/bin/libggml-cpu.so (0x00007f5f2e882000)
        libggml-cuda.so => /home/wim/src/llama.cpp/build-cuda/bin/libggml-cuda.so (0x00007f5f2c000000)
        libcuda.so.1 => /home/wim/Downloads/zluda/libcuda.so.1 (0x00007f5f2b800000)
        libggml-rpc.so => /home/wim/src/llama.cpp/build-cuda/bin/libggml-rpc.so (0x00007f5f2ed72000)
        libggml-base.so => /home/wim/src/llama.cpp/build-cuda/bin/libggml-base.so (0x00007f5f2eccf000)
...
        libcudart.so.13 => not found
        libcublas.so.13 => not found
        libcublasLt.so.13 => not found
        libamdhip64.so.6 => /lib64/libamdhip64.so.6 (0x00007f5f29400000)
...

  1. When using the missing libraries from a normal CUDA setup, the process launches but fails to initialize a CUDA device:
$ ./bin/llama-bench  -m ~/.cache/llama.cpp/google_gemma-3-27b-it-qat-q4_0-gguf_gemma-3-27b-it-q4_0.gguf -ngl 999
ggml_cuda_init: failed to initialize CUDA: CUDA driver version is insufficient for CUDA runtime version
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |

Steps to reproduce

My build commands for llama.cpp (running in a container with CUDA installed):

cmake -B build-cuda -DGGML_CUDA=ON -DGGML_RPC=ON -DCMAKE_CUDA_ARCHITECTURES="75;86;89"
cmake --build build-cuda/ --config Release -j10

I'm using ZLUDA v5 from the github release archive.

ZLUDA version

5

Operating System

Fedora 42

GPU

AMD Ryzen AI Max 395+ (Radeon RX 8060S)

de-wim avatar Oct 03 '25 11:10 de-wim

Try the most recent build (Version 6-preview.4), this should resolve CUDA 13 failures. As for libcudart.so.13 you should just use the NVIDIA binary, it's provided by cuda-cudart-13-0 on Ubuntu, I don't know about Fedora. I can't run your model because I don't have a setup with this much GPU memory at hand so I'm curious about your results

vosen avatar Oct 06 '25 19:10 vosen

I just retested and it didn't give the same error as before, but it now crashes during execution. However, I retested with the ROCm build for llama.cpp and the same thing happens there - so it might be a case of a system upgrade having broken the ROCm setup.

I should have another system later today with a 7900 XTX where I can set up a fully supported distro and hopefully test again. I will update once I've got more to report.

de-wim avatar Oct 09 '25 11:10 de-wim

Ran again, on a clean Ubuntu 24.04 setup with a ROCm 7.0.1 install & ZLUDA built from source, getting the same error as on my other system (which is Fedora and ROCm 6.4), so probably not caused by the setup:

wim@ramjet:~/src/llama.cpp$ zluda-run ./build-cuda/bin/llama-bench -m ~/.cache/llama.cpp/google_gemma-3-27b-it-qat-q4_0-gguf_gemma-3-27b-it-q4_0.gguf -ngl 999
./build-cuda/bin/llama-bench: /opt/zluda/libcublas.so.13: no version information available (required by /home/wim/src/llama.cpp/build-cuda/bin/libggml-cuda.so)
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Radeon RX 7900 XTX [ZLUDA], compute capability 8.8, VMM: no
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
/home/wim/src/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:88: CUDA error
[New LWP 58612]
[New LWP 58609]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x000072c484110813 in __GI___wait4 (pid=58629, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30     ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
#0  0x000072c484110813 in __GI___wait4 (pid=58629, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x000072c484770633 in ggml_print_backtrace () from /home/wim/src/llama.cpp/build-cuda/bin/libggml-base.so
#2  0x000072c4847707db in ggml_abort () from /home/wim/src/llama.cpp/build-cuda/bin/libggml-base.so
#3  0x000072c47df3fe37 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /home/wim/src/llama.cpp/build-cuda/bin/libggml-cuda.so
#4  0x000072c47e25b2ae in void launch_mul_mat_q<(ggml_type)2, 128>(ggml_backend_cuda_context&, mmq_args const&, CUstream_st*) () from /home/wim/src/llama.cpp/build-cuda/bin/libggml-cuda.so
#5  0x000072c47df6897f in ggml_cuda_mul_mat_q(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor const*, ggml_tensor*) () from /home/wim/src/llama.cpp/build-cuda/bin/libggml-cuda.so
#6  0x000072c47df519eb in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) () from /home/wim/src/llama.cpp/build-cuda/bin/libggml-cuda.so
#7  0x000072c48478b777 in ggml_backend_sched_graph_compute_async () from /home/wim/src/llama.cpp/build-cuda/bin/libggml-base.so
#8  0x000072c4848a19e1 in llama_context::graph_compute(ggml_cgraph*, bool) () from /home/wim/src/llama.cpp/build-cuda/bin/libllama.so
#9  0x000072c4848a341c in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) () from /home/wim/src/llama.cpp/build-cuda/bin/libllama.so
#10 0x000072c4848a970f in llama_context::decode(llama_batch const&) () from /home/wim/src/llama.cpp/build-cuda/bin/libllama.so
#11 0x000072c4848aa62f in llama_decode () from /home/wim/src/llama.cpp/build-cuda/bin/libllama.so
#12 0x00005dc202d041a1 in test_prompt(llama_context*, int, int, int) ()
#13 0x00005dc202d0011b in main ()
[Inferior 1 (process 58606) detached]
/usr/local/bin/zluda-run: line 12: 58606 Aborted                 (core dumped) "${@}"

If there are any more tests I can run or data I can collect to help debug this, please let me know.

de-wim avatar Oct 10 '25 17:10 de-wim

Running with export AMD_LOG_LEVEL=4 HIP_LAUNCH_BLOCKING=1 AMD_SERIALIZE_KERNEL=3 AMD_SERIALIZE_COPY=3 gave me the following:

...
:3:hip_device.cpp           :656 : 33585513589 us: [pid:68741 tid: 0x7ed36a757000]  hipGetDevicePropertiesR0600 ( 0x7ffc06033a60, 0 ) 
:3:hip_device.cpp           :658 : 33585513607 us: [pid:68741 tid: 0x7ed36a757000] hipGetDevicePropertiesR0600: Returned hipSuccess : 
:3:hip_module.cpp           :62  : 33585522699 us: [pid:68741 tid: 0x7ed36a757000]  hipModuleLoadData ( 0x7ffc06033968, 0x5bf21340ef50 ) 
:3:devprogram.cpp           :2621: 33585523581 us: [pid:68741 tid: 0x7ed36a757000] Using Code Object V5.
:3:hip_module.cpp           :63  : 33585527536 us: [pid:68741 tid: 0x7ed36a757000] hipModuleLoadData: Returned hipSuccess : 
:3:hip_module.cpp           :78  : 33585531341 us: [pid:68741 tid: 0x7ed36a757000]  hipModuleGetFunction ( 0x7ffc06034748, 0x5bf211879220, _Z9mul_mat_qIL9ggml_type2ELi128ELb0EEvPKcPKiS4_S4_PfS5_iiiiiiiiiiiiiiiii ) 
:1:hip_code_object.cpp      :1174: 33585531349 us: [pid:68741 tid: 0x7ed36a757000] Cannot find the function: _Z9mul_mat_qIL9ggml_type2ELi128ELb0EEvPKcPKiS4_S4_PfS5_iiiiiiiiiiiiiiiii 
:1:hip_module.cpp           :88  : 33585531353 us: [pid:68741 tid: 0x7ed36a757000] Cannot find the function: _Z9mul_mat_qIL9ggml_type2ELi128ELb0EEvPKcPKiS4_S4_PfS5_iiiiiiiiiiiiiiiii for module: 0x11879220
:3:hip_module.cpp           :89  : 33585531357 us: [pid:68741 tid: 0x7ed36a757000] hipModuleGetFunction: Returned hipErrorNotFound : 
/home/wim/src/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:88: CUDA error

de-wim avatar Oct 10 '25 17:10 de-wim

I've captured the trace logs for this here zluda-llama-cpp-gemma3.tar.gz

As I mentioned on Discord - I've built ZLUDA and llama.cpp against ROCm 7.0.1, if this is complicating things, let me know and I can try and retest using ROCm 6.x

de-wim avatar Oct 11 '25 05:10 de-wim

Thanks, the problem is 100% on the ZLUDA side. We do not implement the mma.sync.aligned.m16n8k32.row.col.s32.s8.s8.s32 instruction. Good news is that the support for mma. family of instructions is what we are working right now; bad news is that it's fairly time consuming, so I ask for your patience

vosen avatar Oct 11 '25 06:10 vosen

As of #571 this should work correctly. With a caveat: I recommend building your llama.cpp with GGML_CUDA_FORCE_CUBLAS=1. GGML_CUDA_FORCE_CUBLAS=0 works just fine but cublas path is much faster. I did not run a full performance benchmark yet, but from what I tried it should be marginally faster than rocm backend (bf16) or the same (integer quantizations). I tried it on Linux with CUDA 13. ZLUDA is not compatible with ROCm 7 yet

vosen avatar Dec 12 '25 22:12 vosen

Hi,

First of all, many thanks for the hard work, and apologies for the delay in validating this fix.

Unfortunately I am now experiencing segfaults on my laptop (Fedora 43, rocm 6.4.3, CUDA 13.1, ZLUDA 629158c, llama.cpp 10b4f82d).

I collected the zluda trace logs again in case they help:

zluda-logs-llama-cpp.tar.gz

When running with gdb attached, I get the following backtrace:

Thread 1 "llama-bench" received signal SIGSEGV, Segmentation fault.
0x00000000000000d0 in ?? ()
Missing rpms, try: dnf --enablerepo='*debug*' install cuda-cudart-13-1-debuginfo-13.1.80-1.x86_64 libdrm-amdgpu-debuginfo-2.4.125.70101-2255337.el9.x86_64
(gdb) bt
#0  0x00000000000000d0 in ?? ()
#1  0x00007fffeb23f0e8 in ?? () from /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.13
#2  0x00007fffeb2456b8 in ?? () from /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.13
#3  0x00007fffeb58f804 in __pthread_once_slow (once_control=0x7fffeb4b93d0, init_routine=0x7fffeb245670) at pthread_once.c:116
#4  0x00007fffeb58f879 in ___pthread_once (once_control=<optimized out>, init_routine=<optimized out>) at pthread_once.c:143
#5  0x00007fffeb28e369 in ?? () from /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.13
#6  0x00007fffeb242b1f in ?? () from /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.13
#7  0x00007fffeb24fbfa in cudaGetDeviceCount () from /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.13
#8  0x00007ffff0721ef0 in ggml_cuda_init() () from /home/wim/src/llama.cpp/build-zluda/bin/libggml-cuda.so.0
#9  0x00007ffff0722fad in ggml_cuda_info() () from /home/wim/src/llama.cpp/build-zluda/bin/libggml-cuda.so.0
#10 0x00007ffff07248b5 in ggml_backend_cuda_reg () from /home/wim/src/llama.cpp/build-zluda/bin/libggml-cuda.so.0
#11 0x00007ffff7fb12fe in get_reg() () from /home/wim/src/llama.cpp/build-zluda/bin/libggml.so.0
#12 0x00007ffff7fb2d15 in ggml_backend_load_best(char const*, bool, char const*) [clone .constprop.0] [clone .isra.0] ()
   from /home/wim/src/llama.cpp/build-zluda/bin/libggml.so.0
#13 0x00007ffff7fb4a50 in ggml_backend_load_all_from_path () from /home/wim/src/llama.cpp/build-zluda/bin/libggml.so.0
#14 0x0000000000404f7a in main ()

I downgraded my laptop for this back to rocm 6.4, only to discover that the next PR that was merged enabled ROCm 7.x 🙃

de-wim avatar Dec 20 '25 21:12 de-wim

Back on ROCm 7.1 with the latest version of ZLUDA built from source, I get no crashes but llama-bench gets stuck with 100% load on a single CPU core and no GPU load.

de-wim avatar Dec 20 '25 22:12 de-wim