llama.cpp
llama.cpp copied to clipboard
CUDA error: invalid device function when compiling and running for amd gfx 1032
Please include information about your system, the steps to reproduce the bug, and the version of llama.cpp that you are using. If possible, please provide a minimal code example that reproduces the bug. I have a 6700s amd gpu, 8gb vram. I got ooga to work on this computer, but I can't get llama.ccp to work. I compiled with make clean && make -j16 LLAMA_HIPBLAS=1 AMDGPU_TARGETS=gxf1032 And everything went fine. However, when I try to run, I do export HSA_OVERRIDE_GFX_VERSION=10.3.0 then HIP_VISIBLE_DEVICES=0 ./main -ngl 50 -m /home/lenovoubuntu/Downloads/text-generation-webui-main/models/dolphin-2.6-mistral-7b-dpo.Q4_K_M.gguf -p "Write a function in TypeScript that sums numbers". (I do HIP devices function since my devices has an igpu as well).
It returns ................................................................................................. llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: VRAM kv self = 64.00 MB llama_new_context_with_model: KV self size = 64.00 MiB, K (f16): 32.00 MiB, V (f16): 32.00 MiB llama_build_graph: non-view tensors processed: 676/676 llama_new_context_with_model: compute buffer total size = 76.19 MiB llama_new_context_with_model: VRAM scratch buffer: 73.00 MiB llama_new_context_with_model: total VRAM used: 4232.06 MiB (model: 4095.06 MiB, context: 137.00 MiB) CUDA error: invalid device function current device: 0, in function ggml_cuda_op_flatten at ggml-cuda.cu:7971 hipGetLastError() GGML_ASSERT: ggml-cuda.cu:226: !"CUDA error" Could not attach to process. If your uid matches the uid of the target process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf ptrace: Operation not permitted. No stack. The program is not being run. Aborted (core dumped)
So, I ran it as as sudo, as it suggested using this command. sudo LD_LIBRARY_PATH=/opt/rocm/lib:$LD_LIBRARY_PATH HSA_OVERRIDE_GFX_VERSION=10.3.0 HIP_VISIBLE_DEVICES=0 ./main -ngl 50 -m /home/lenovoubuntu/Downloads/text-generation-webui-main/models/dolphin-2.6-mistral-7b-dpo.Q4_K_M.gguf -p "Write a function in TypeScript that sums numbers" I used all of those environment variables since ooga required them, and I was hoping they would fix things here too.
However, that just returns this after seemingly loading the model.
CUDA error: invalid device function current device: 0, in function ggml_cuda_op_flatten at ggml-cuda.cu:7971 hipGetLastError() GGML_ASSERT: ggml-cuda.cu:226: !"CUDA error" [New LWP 23593] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". 0x00007f34398ea42f in __GI___wait4 (pid=23599, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30 30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory. #0 0x00007f34398ea42f in __GI___wait4 (pid=23599, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30 30 in ../sysdeps/unix/sysv/linux/wait4.c #1 0x000055fb56cca7fb in ggml_print_backtrace () #2 0x000055fb56d90f95 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () #3 0x000055fb56d9da1e in ggml_cuda_op_flatten(ggml_tensor const*, ggml_tensor const*, ggml_tensor*, void ()(ggml_tensor const, ggml_tensor const*, ggml_tensor*, float const*, float const*, float*, ihipStream_t*)) () #4 0x000055fb56d92df3 in ggml_cuda_compute_forward () #5 0x000055fb56cf8898 in ggml_graph_compute_thread () #6 0x000055fb56cfca98 in ggml_graph_compute () #7 0x000055fb56dbc41e in ggml_backend_cpu_graph_compute () #8 0x000055fb56dbcf0b in ggml_backend_graph_compute () #9 0x000055fb56d2b046 in llama_decode_internal(llama_context&, llama_batch) () #10 0x000055fb56d2bb63 in llama_decode () #11 0x000055fb56d66316 in llama_init_from_gpt_params(gpt_params&) () #12 0x000055fb56cbc31a in main () [Inferior 1 (process 23582) detached] Aborted
I get a similar error CUDA error: invalid device function current device: 0, in function ggml_cuda_op_flatten at ggml-cuda.cu:7971
on amd 780m (igpu) while trying to run any model.
llama.cpp compiled with LLAMA_HIPBLAS=1 LLAMA_HIP_UMA=1 AMDGPU_TARGETS=gfx1100
and ran with HSA_OVERRIDE_GFX_VERSION=gfx1100
ROCm version 5.7.1
I also had similar error when running on my gfx90c
device (which needs to be overridden to gfx900
).
What solved the problem for me was also setting the environment variable HSA_OVERRIDE_GFX_VERSION
when running make
(together with the AMDGPU_TARGETS
, although I'm not exactly sure if this value actually changes anything).
So for me, the make
command would look like this:
HSA_OVERRIDE_GFX_VERSION=9.0.0 make -j16 LLAMA_HIPBLAS=1 LLAMA_HIP_UMA=1 AMDGPU_TARGETS=gfx900
I honestly didn't think that this would work at all, but it certainly did! For me, though, since my iGPU lacks INT8 operators, performance was worse than just using CPU, but it did run on the iGPU (checked with nvtop
).
Hope that this works for you too!
My guess on why this hasn't been reported much
I would say that quite a few people have already ran
export HSA_OVERRIDE_GFX_VERSION=xxx
before, which would make this environment variable available to all programs running in the shell, making subsequent explicit declaration unnecessary.
What solved the problem for me was also setting the environment variable HSA_OVERRIDE_GFX_VERSION when running make (together with the AMDGPU_TARGETS, although I'm not exactly sure if this value actually changes anything).
Thank you! This hint finally allowed me to run all 33 layers of Mixtral Q5_K_M on iGPU. Since it's an APU with shared ram, it can't compete with dGPUs, but the speedup is close to 70% nonetheless.
CPU (7840u):
llama_print_timings: load time = 2052.23 ms
llama_print_timings: sample time = 111.57 ms / 727 runs ( 0.15 ms per token, 6515.97 tokens per second)
llama_print_timings: prompt eval time = 34619.23 ms / 538 tokens ( 64.35 ms per token, 15.54 tokens per second)
llama_print_timings: eval time = 248061.72 ms / 726 runs ( 341.68 ms per token, 2.93 tokens per second)
llama_print_timings: total time = 283023.52 ms
GPU (780m):
llama_print_timings: load time = 39038.83 ms
llama_print_timings: sample time = 132.02 ms / 867 runs ( 0.15 ms per token, 6567.34 tokens per second)
llama_print_timings: prompt eval time = 44011.30 ms / 538 tokens ( 81.81 ms per token, 12.22 tokens per second)
llama_print_timings: eval time = 181460.51 ms / 866 runs ( 209.54 ms per token, 4.77 tokens per second)
llama_print_timings: total time = 225876.68 ms
Strangely, prompt processing is slower on GPU.
This issue was closed because it has been inactive for 14 days since being marked as stale.