Radeon RX 6700 XT not utilized
I updated ramalama from 0.5.4 to 0.7.2 (it just worked as I needed and I forgot to update it, great job btw!) because I want to experiment with the rag option and I see my gpu is not fully utilized in the ramalama serve command. I dig into it a bit and this issue happens from 0.6.0. It happens even with the --ngl 999 set, in the 0.5.4 I was using --gpu switch for forcing gpu.
I have Radeon RX 6700 XT, HSA_OVERRIDE_GFX_VERSION=10.3.0 env var for overriding it to be supported ) is set, Fedora 41, podman 5.4.1.
Ramalama 0.7.2
Bench command
I see the gpu is detected and in the bench command is utilized.
Output from ramalama --debug bench qwen-coder:14b with screenshot from radeontop on ramalama==0.7.2:
$ ramalama --debug bench qwen-coder:14b
run_cmd: podman inspect quay.io/ramalama/ramalama:0.7
Working directory: None
Ignore stderr: False
Ignore all: True
Command finished with return code: 0
run_cmd: podman inspect quay.io/ramalama/ramalama:0.7
Working directory: None
Ignore stderr: False
Ignore all: True
Command finished with return code: 0
exec_cmd: podman run --rm -i --label ai.ramalama --name ramalama_ugaXCEl6nT --env=HOME=/tmp --init --security-opt=label=disable --cap-drop=all --security-opt=no-new-privileges --label ai.ramalama.model=ollama://qwen2.5-coder:14b --label ai.ramalama.engine=podman --label ai.ramalama.runtime=llama.cpp --label ai.ramalama.command=bench --pull=newer -t --device /dev/dri --device /dev/kfd -e HSA_OVERRIDE_GFX_VERSION=10.3.0 --network none --mount=type=bind,src=/home/hosek/.local/share/ramalama/models/ollama/qwen2.5-coder:14b,destination=/mnt/models/model.file,ro quay.io/ramalama/ramalama:0.7 llama-bench --threads 8 -m /mnt/models/model.file
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 6700 XT (RADV NAVI22) (radv) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 65536 | matrix cores: none
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| qwen2 14B Q4_K - Medium | 8.37 GiB | 14.77 B | Vulkan | 99 | pp512 | 252.24 ± 0.51 |
| qwen2 14B Q4_K - Medium | 8.37 GiB | 14.77 B | Vulkan | 99 | tg128 | 35.81 ± 0.05 |
build: ef19c717 (4958)
GPU:
CPU:
Serve command
When I run ramalama --debug serve qwen-coder:14b --name qwen --detach -p 12401 --ngl 999 the output is (with utilization during work with model):
$ ramalama --debug serve qwen-coder:14b --name qwen --detach -p 12401 --ngl 999
run_cmd: podman inspect quay.io/ramalama/ramalama:0.7
Working directory: None
Ignore stderr: False
Ignore all: True
Command finished with return code: 0
run_cmd: podman inspect quay.io/ramalama/ramalama:0.7
Working directory: None
Ignore stderr: False
Ignore all: True
Command finished with return code: 0
run_cmd: podman inspect quay.io/ramalama/ramalama:0.7
Working directory: None
Ignore stderr: False
Ignore all: True
Command finished with return code: 0
exec_cmd: podman run --rm -i --label ai.ramalama --name qwen --env=HOME=/tmp --init --security-opt=label=disable --cap-drop=all --security-opt=no-new-privileges --label ai.ramalama.model=ollama://qwen2.5-coder:14b --label ai.ramalama.engine=podman --label ai.ramalama.runtime=llama.cpp --label ai.ramalama.port=12401 --label ai.ramalama.command=serve --pull=newer -t -d -p 12401:12401 --device /dev/dri --device /dev/kfd -e HSA_OVERRIDE_GFX_VERSION=10.3.0 --mount=type=bind,src=/home/hosek/.local/share/ramalama/models/ollama/qwen2.5-coder:14b,destination=/mnt/models/model.file,ro quay.io/ramalama/ramalama:0.7 llama-server --port 12401 --model /mnt/models/model.file --alias qwen2.5-coder:14b --ctx-size 2048 --temp 0.8 -v --threads 8 --host 0.0.0.0
3836dd741633060d5ec96f0ba314d4a00b08ce89210147b3ed3f10a4da82d558
GPU (is doing something in the beggining - this is peak, but when starts generating text, it's off and all the work does the CPU):
CPU (during generation):
Ramalama 0.5.5 (latest working version)
For comparison utilization and outputs from the 0.5.5:
Bench
$ ramalama --debug bench qwen-coder:14b
run_cmd: podman inspect quay.io/ramalama/ramalama:0.5
Working directory: None
Ignore stderr: False
Ignore all: True
exec_cmd: podman run --rm -i --label RAMALAMA --security-opt=label=disable --name ramalama_MYV4dXiypb --pull=newer -t --device /dev/dri --device /dev/kfd -e HSA_OVERRIDE_GFX_VERSION=10.3.0 --mount=type=bind,src=/home/hosek/.local/share/ramalama/models/ollama/qwen2.5-coder:14b,destination=/mnt/models/model.file,ro quay.io/ramalama/ramalama:latest llama-bench -m /mnt/models/model.file
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 6700 XT (RADV NAVI22) (radv) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 65536 | matrix cores: none
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| qwen2 14B Q4_K - Medium | 8.37 GiB | 14.77 B | Vulkan | 99 | pp512 | 251.02 ± 0.28 |
| qwen2 14B Q4_K - Medium | 8.37 GiB | 14.77 B | Vulkan | 99 | tg128 | 30.72 ± 0.06 |
build: ef19c717 (4958)
Serve
$ ramalama --gpu --debug serve qwen-coder:14b --name qwen --detach -p 12401
run_cmd: podman inspect quay.io/ramalama/ramalama:0.5
Working directory: None
Ignore stderr: False
Ignore all: True
exec_cmd: podman run --rm -i --label RAMALAMA --security-opt=label=disable --name qwen --pull=newer -t -d -p 12401:12401 --device /dev/dri --device /dev/kfd -e HSA_OVERRIDE_GFX_VERSION=10.3.0 --mount=type=bind,src=/home/hosek/.local/share/ramalama/models/ollama/qwen2.5-coder:14b,destination=/mnt/models/model.file,ro quay.io/ramalama/ramalama:latest llama-server --port 12401 -m /mnt/models/model.file -c 2048 --temp 0.8 -ngl 999 --host 0.0.0.0
4ed40093b328a33eb05529c77737006f4eabf4ae9dddf2a8b6b27c27f4fbf61f
GPU (much faster generation, GPU is definitely used):
CPU (during generation)
One major change that happened between ramalama 0.5.4 and 0.7.2 is we switched from the UBI + AMD ROCm packages for the runtime image to Fedora 42 using Fedora's ROCm packages because the Fedora ROCm maintainer enabled a lot more AMD gfx series and that provided the most coverage of consumer hardware.
However I don't think that's you issue because the command output indicates it is using the quay.io/ramalama/ramalama:latest container image which doesn't contain any of the AMD ROCm accelerator enabled machinery. In light of that, I think is weird is that it's touching your GPU at all for the bench run. Something is wonky there.
Can you please provide the output of the following command? rocminfo | grep gfx
Also, can you try this command for your serving attempt and see if that changes the behavior at all? (this will force the use of the AMD ROCm enabled image, which is seems like ramalama isn't detecting you card correctly and selecting that image):
ramalama --image=quay.io/ramalama/rocm --gpu --debug serve qwen-coder:14b --name qwen --detach -p 12401
rocminfo | grep gfx:
$ rocminfo | grep gfx
Name: gfx1030
Name: amdgcn-amd-amdhsa--gfx1030
ramalama --image=quay.io/ramalama/rocm --debug serve qwen-coder:14b --name qwen --detach -p 12401ramalma:0.7.2(without--gpuas it's not supported)
$ ramalama --image=quay.io/ramalama/rocm --debug serve qwen-coder:14b --name qwen --detach -p 12401
exec_cmd: podman run --rm -i --label ai.ramalama --name qwen --env=HOME=/tmp --init --security-opt=label=disable --cap-drop=all --security-opt=no-new-privileges --label ai.ramalama.model=ollama://qwen2.5-coder:14b --label ai.ramalama.engine=podman --label ai.ramalama.runtime=llama.cpp --label ai.ramalama.port=12401 --label ai.ramalama.command=serve --pull=newer -t -d -p 12401:12401 --device /dev/dri --device /dev/kfd -e HSA_OVERRIDE_GFX_VERSION=10.3.0 --mount=type=bind,src=/home/hosek/.local/share/ramalama/models/ollama/qwen2.5-coder:14b,destination=/mnt/models/model.file,ro quay.io/ramalama/rocm llama-server --port 12401 --model /mnt/models/model.file --alias qwen2.5-coder:14b --ctx-size 2048 --temp 0.8 -v --threads 8 --host 0.0.0.0
Trying to pull quay.io/ramalama/rocm:latest...
Getting image source signatures
Copying blob b931a26c663b done |
Copying blob fc454ec91464 done |
Copying blob f5b2012436d8 done |
Copying config 4d428a2f62 done |
Writing manifest to image destination
604036a057accfe803ef4dbe429ce30ac2cc834a7e012c90abac6a7f5157a7f6
and the gpu not being utilized
ramalama --image=quay.io/ramalama/rocm --debug serve qwen-coder:14b --name qwen --detach -p 12401ramalma:0.7.4(without--gpuas it's not supported)
$ ramalama --image=quay.io/ramalama/rocm --debug serve qwen-coder:14b --name qwen --detach -p 12401
exec_cmd: podman run --rm -i --label ai.ramalama --name qwen --env=HOME=/tmp --init --security-opt=label=disable --cap-drop=all --security-opt=no-new-privileges --label ai.ramalama.model=ollama://qwen2.5-coder:14b --label ai.ramalama.engine=podman --label ai.ramalama.runtime=llama.cpp --label ai.ramalama.port=12401 --label ai.ramalama.command=serve --pull=newer -t -d -p 12401:12401 --device /dev/dri --device /dev/kfd -e HSA_OVERRIDE_GFX_VERSION=10.3.0 --mount=type=bind,src=/home/hosek/.local/share/ramalama/models/ollama/qwen2.5-coder:14b,destination=/mnt/models/model.file,ro quay.io/ramalama/rocm:0.7 llama-server --port 12401 --model /mnt/models/model.file --alias qwen2.5-coder:14b --ctx-size 2048 --temp 0.8 --jinja --cache-reuse 256 -v --threads 8 --host 0.0.0.0
Trying to pull quay.io/ramalama/rocm:0.7...
Getting image source signatures
Copying blob b931a26c663b skipped: already exists
Copying blob fc454ec91464 skipped: already exists
Copying blob f5b2012436d8 skipped: already exists
Copying config 4d428a2f62 done |
Writing manifest to image destination
d4f43c371c4c3b4b30300dde6a8c8d3f8f6430be4f80fb3060b37a62890b7ac8
still the same, gpu not utilized
If there is anything else I could try/provide etc., feel free to ping me. I like the project, so it's a pleasure for me to help you with making it better.
RTX 6900 works for me (main branch).
I do not see ngl passed in the podman invocation from your output:
exec_cmd: podman run --rm -i --label ai.ramalama --name qwen --env=HOME=/tmp --init --security-opt=label=disable --cap-drop=all --security-opt=no-new-privileges --label ai.ramalama.model=ollama://qwen2.5-coder:14b --label ai.ramalama.engine=podman --label ai.ramalama.runtime=llama.cpp --label ai.ramalama.port=12401 --label ai.ramalama.command=serve --pull=newer -t -d -p 12401:12401 --device /dev/dri --device /dev/kfd -e HSA_OVERRIDE_GFX_VERSION=10.3.0 --mount=type=bind,src=/home/hosek/.local/share/ramalama/models/ollama/qwen2.5-coder:14b,destination=/mnt/models/model.file,ro quay.io/ramalama/ramalama:0.7 llama-server --port 12401 --model /mnt/models/model.file --alias qwen2.5-coder:14b --ctx-size 2048 --temp 0.8 -v --threads 8 --host 0.0.0.0
mine:
ramalama --debug serve qwen2.5-coder:14b --name qwen --detach -p 12401 --ngl 999 exec_cmd: podman run --rm --label ai.ramalama.model=qwen2.5-coder:14b --label ai.ramalama.engine=podman --label ai.ramalama.runtime=llama.cpp --label ai.ramalama.port=12401 --label ai.ramalama.command=serve --device /dev/dri --device /dev/kfd -e HIP_VISIBLE_DEVICES=0 -p 12401:12401 --security-opt=label=disable --cap-drop=all --security-opt=no-new-privileges --pull newer -t -d -i --label ai.ramalama --name qwen --env=HOME=/tmp --init --label ai.ramalama.model=qwen2.5-coder:14b --label ai.ramalama.engine=podman --label ai.ramalama.runtime=llama.cpp --label ai.ramalama.port=12401 --label ai.ramalama.command=serve --mount=type=bind,src=/home/turul/.local/share/ramalama/store/ollama/qwen2.5-coder/qwen2.5-coder/blobs/sha256-ac9bc7a69dab38da1c790838955f1293420b55ab555ef6b4615efa1c1507b1ed,destination=/mnt/models/model.file,ro --mount=type=bind,src=/home/turul/.local/share/ramalama/store/ollama/qwen2.5-coder/qwen2.5-coder/snapshots/sha256-ac9bc7a69dab38da1c790838955f1293420b55ab555ef6b4615efa1c1507b1ed/chat_template_converted,destination=/mnt/models/chat_template.file,ro quay.io/ramalama/rocm:0.7 llama-server --port 12401 --model /mnt/models/model.file --alias qwen2.5-coder:14b --ctx-size 2048 --temp 0.8 --jinja --cache-reuse 256 -v -ngl 999 --threads 12 --host 0.0.0.0 d6343e513bf23751718a49e7703a912dda9c7b4d8dbb2a9456c78bb05a81e54f
Hi @hosekadam . Please make sure you have proper permissions for /dev/dri and /dev/kfd. I was testing on a RHEL 9.5 with the ROCm packages from AMD and they were bogus. After fixing it, it worked for me.
@marceloleitner Thanks for your comment! I just checked and:
$ ls -ld /dev/dri
drwxr-xr-x. 3 root root 100 May 19 11:28 /dev/dri
$ ls -ld /dev/kfd
crw-rw-rw-. 1 root render 235, 0 May 19 11:28 /dev/kfd
Are those expected permissions, or should I change to something different?
@afazekas Thanks for your comment too! Is there anything you can recommend me to try? I'm sorry, but I don't know what to do to have ngl values passed. I'm adding it only with the problematic version, for 0.5.5 I'm not passing it as GPU works.
I also quickly checked with the latest version ramalama-0.8.3 and the problem still persists - gpu is not being utilized.
Since ramalama tring to use the ramalma image instead of the rocm image likely it misses something about the GPU. I do not remember anything special about f41and rocm it was OK for me.
Even tough the vulcan (ramalma image) should work with the GPU too .
My typical issue with rocm was the in cpu(amd) gpu confused many tools, the simplest solution for that is simply disable the in-cpu VGA in BIOS.
Probably checking/debug printing around the container selection logic might give a hint.
I played with the latest ramalama a bit more and here are some results (if it could help with debugging):
- I have AMD 5700x which doesn't have any integrated GPU, so I don't expect the problem could be there
- about the working Radeon RX 6900 XT it would be because it's officially supported. The RX 6700 XT is not, I need to use
HSA_OVERRIDE_GFX_VERSION=10.3.0as described under the officially supported GPUs - ramalama from fedora repo instead of pip
- I have AMD 5700x which doesn't have any integrated GPU, so I don't expect the problem could be there
- different models, including smaller versions
- running without detach parameter and forcing the rocm image provides a bit more info:
command:
$ ramalama --debug --image=quay.io/ramalama/rocm serve qwen-coder:7b --name qwen --ngl 999output (shorted):
exec_cmd: podman run --rm --label ai.ramalama.model=ollama://qwen2.5-coder:7b --label ai.ramalama.engine=podman --label ai.ramalama.runtime=llama.cpp --label ai.ramalama.port=8080 --label ai.ramalama.command=serve --device /dev/dri --device /dev/kfd -e HSA_OVERRIDE_GFX_VERSION=10.3.0 -p 8080:8080 --security-opt=label=disable --cap-drop=all --security-opt=no-new-privileges --pull newer -t -i --label ai.ramalama --name qwen --env=HOME=/tmp --init --label ai.ramalama.model=ollama://qwen2.5-coder:7b --label ai.ramalama.engine=podman --label ai.ramalama.runtime=llama.cpp --label ai.ramalama.port=8080 --label ai.ramalama.command=serve --mount=type=bind,src=/home/hosek/.local/share/ramalama/store/ollama/qwen2.5-coder/qwen2.5-coder/blobs/sha256-60e05f2100071479f596b964f89f510f057ce397ea22f2833a0cfe029bfc2463,destination=/mnt/models/model.file,ro --mount=type=bind,src=/home/hosek/.local/share/ramalama/store/ollama/qwen2.5-coder/qwen2.5-coder/snapshots/sha256-60e05f2100071479f596b964f89f510f057ce397ea22f2833a0cfe029bfc2463/chat_template_converted,destination=/mnt/models/chat_template.file,ro quay.io/ramalama/rocm:0.8 /usr/libexec/ramalama/ramalama-serve-core llama-server --port 8080 --model /mnt/models/model.file --alias qwen2.5-coder:7b --ctx-size 2048 --temp 0.8 --jinja --cache-reuse 256 -v --threads 8 --host 0.0.0.0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 6700 XT, gfx1030 (0x1030), VMM: no, Wave Size: 32
build: 5429 (e298d2fb) with cc (GCC) 15.1.1 20250521 (Red Hat 15.1.1-2) for x86_64-redhat-linux
system info: n_threads = 8, n_threads_batch = 8, total_threads = 16
system_info: n_threads = 8 (n_threads_batch = 8) / 16 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
...
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon RX 6700 XT) - 12222 MiB free
...
load_tensors: tensor 'token_embd.weight' (q4_K) (and 338 others) cannot be used with preferred buffer type ROCm_Host, using CPU instead
load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 0/29 layers to GPU
load_tensors: CPU_Mapped model buffer size = 4460.45 MiB
Based on the output, I expect the GPU is correctly detected, but in some process of the loader is not used. Does anybody know what's happening there and possibly how can I fix it?
Might be better off asking at llama.cpp to see if they have an idea what is going on. @ericcurtin @ggerganov Thoughts?
@hosekadam
load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 0/29 layers to GPU
Think you are missing -ngl 999 in your command.
@ggerganov I have --ngl 999 in the command I'm running, it's the last parameter. I rather tried it now to make sure if I didn't forget it, but I didn't - exact same result as before. But I don't see it in exec_cmd: in ramalama output. Should it be passed there?
I'll ask at llama.cpp and provide solution there later (I hope there will be some).
exec_cmd: podman run ... /usr/libexec/ramalama/ramalama-serve-core llama-server --port 8080 --model /mnt/models/model.file --alias qwen2.5-coder:7b --ctx-size 2048 --temp 0.8 --jinja --cache-reuse 256 -v --threads 8 --host 0.0.0.0
I am not sure the --ngl line is being passed?
@ericcurtin ^^
I wonder what these commands do in your system: echo /sys/bus/pci/devices//mem_info_vram_total cat /sys/bus/pci/devices//mem_info_vram_total
A number above 1073741824 should enable rocm usage by default.
--ngl might not be passed when no rocm (accel method) detected.
you might try --runtime-args="--ngl 99" or --runtime-args="-ngl 99" to bypass this. We might not need to filter the ngl argument in case of no_accel since llama.cpp can ignore it when no GPU detected.
@afazekas I tried running the command $ cat /sys/bus/pci/devices/0000:0a:00.1/mem_info_vram_total where the 0000:0a:00.0 I identified from output of lspci where the GPU was specified like 0a:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 22 [Radeon RX 6700/6700 XT/6750 XT / 6800M/6850M XT] [1002:73df] (rev c5) (describing if anybody else has the same issue) and the output is 12868124672:
$ cat /sys/bus/pci/devices/0000:0a:00.0/mem_info_vram_total
12868124672
Running $ ramalama --debug --image=quay.io/ramalama/rocm serve qwen-coder:7b --name qwen --runtime-args="-ngl 999" results in using the GPU! 🎉
$ ramalama --debug --image=quay.io/ramalama/rocm serve qwen-coder:7b --name qwen --runtime-args="-ngl 999"
...
exec_cmd: podman run --rm --label ai.ramalama.model=ollama://qwen2.5-coder:7b --label ai.ramalama.engine=podman --label ai.ramalama.runtime=llama.cpp --label ai.ramalama.port=8083 --label ai.ramalama.command=serve --device /dev/dri --device /dev/kfd -e HSA_OVERRIDE_GFX_VERSION=10.3.0 -p 8083:8083 --security-opt=label=disable --cap-drop=all --security-opt=no-new-privileges --pull newer -t -i --label ai.ramalama --name qwen --env=HOME=/tmp --init --label ai.ramalama.model=ollama://qwen2.5-coder:7b --label ai.ramalama.engine=podman --label ai.ramalama.runtime=llama.cpp --label ai.ramalama.port=8083 --label ai.ramalama.command=serve --mount=type=bind,src=/home/hosek/.local/share/ramalama/store/ollama/qwen2.5-coder/qwen2.5-coder/blobs/sha256-60e05f2100071479f596b964f89f510f057ce397ea22f2833a0cfe029bfc2463,destination=/mnt/models/model.file,ro --mount=type=bind,src=/home/hosek/.local/share/ramalama/store/ollama/qwen2.5-coder/qwen2.5-coder/snapshots/sha256-60e05f2100071479f596b964f89f510f057ce397ea22f2833a0cfe029bfc2463/chat_template_converted,destination=/mnt/models/chat_template.file,ro quay.io/ramalama/rocm:0.8 /usr/libexec/ramalama/ramalama-serve-core llama-server --port 8083 --model /mnt/models/model.file --alias qwen2.5-coder:7b --ctx-size 2048 --temp 0.8 --jinja --cache-reuse 256 -ngl 999 -v --threads 8 --host 0.0.0.0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 6700 XT, gfx1030 (0x1030), VMM: no, Wave Size: 32
build: 5429 (e298d2fb) with cc (GCC) 15.1.1 20250521 (Red Hat 15.1.1-2) for x86_64-redhat-linux
system info: n_threads = 8, n_threads_batch = 8, total_threads = 16
system_info: n_threads = 8 (n_threads_batch = 8) / 16 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
...
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer 0 assigned to device ROCm0, is_swa = 0
load_tensors: layer 1 assigned to device ROCm0, is_swa = 0
load_tensors: layer 2 assigned to device ROCm0, is_swa = 0
load_tensors: layer 3 assigned to device ROCm0, is_swa = 0
load_tensors: layer 4 assigned to device ROCm0, is_swa = 0
load_tensors: layer 5 assigned to device ROCm0, is_swa = 0
load_tensors: layer 6 assigned to device ROCm0, is_swa = 0
load_tensors: layer 7 assigned to device ROCm0, is_swa = 0
load_tensors: layer 8 assigned to device ROCm0, is_swa = 0
load_tensors: layer 9 assigned to device ROCm0, is_swa = 0
load_tensors: layer 10 assigned to device ROCm0, is_swa = 0
load_tensors: layer 11 assigned to device ROCm0, is_swa = 0
load_tensors: layer 12 assigned to device ROCm0, is_swa = 0
load_tensors: layer 13 assigned to device ROCm0, is_swa = 0
load_tensors: layer 14 assigned to device ROCm0, is_swa = 0
load_tensors: layer 15 assigned to device ROCm0, is_swa = 0
load_tensors: layer 16 assigned to device ROCm0, is_swa = 0
load_tensors: layer 17 assigned to device ROCm0, is_swa = 0
load_tensors: layer 18 assigned to device ROCm0, is_swa = 0
load_tensors: layer 19 assigned to device ROCm0, is_swa = 0
load_tensors: layer 20 assigned to device ROCm0, is_swa = 0
load_tensors: layer 21 assigned to device ROCm0, is_swa = 0
load_tensors: layer 22 assigned to device ROCm0, is_swa = 0
load_tensors: layer 23 assigned to device ROCm0, is_swa = 0
load_tensors: layer 24 assigned to device ROCm0, is_swa = 0
load_tensors: layer 25 assigned to device ROCm0, is_swa = 0
load_tensors: layer 26 assigned to device ROCm0, is_swa = 0
load_tensors: layer 27 assigned to device ROCm0, is_swa = 0
load_tensors: layer 28 assigned to device ROCm0, is_swa = 0
load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type ROCm_Host, using CPU instead
load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors: CPU_Mapped model buffer size = 292.36 MiB
load_tensors: ROCm0 model buffer size = 4168.09 MiB
It still complains about tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type ROCm_Host, using CPU instead but I believe it's not a problem as the GPU is utilized.
❗ Thanks for the help with this issue! It was about not passing the --ngl parameter into the podman command executed by ramalama and need to specify it manually by adding --runtime-args="-ngl 999" (also for others with this issue quickly scrolling over this discussion :))
Is this going to be fixed in some later versions, or is it expected behavior with my setup for some reason?
Looks like when HSA_OVERRIDE_GFX_VERSION is specified the gpu detection is not running so the HIP_VISIBLE_DEVICES=0 is not set.
You can set the HIP_VISIBLE_DEVICES=0 when you set HSA_OVERRIDE_GFX_VERSION. Can you check is it solves everything even the automatic image selection ?
The code consider 'HSA_VISIBLE_DEVICES' also gpu env, which likely not need/used anywhere, probbaly should be removed.
The ngl append logic either should not check anything (passing --ngl even without accel, AFAIK it can work, but it should be tested to be sure. )
Alternatively HSA_OVERRIDE_GFX_VERSION and CUDA_LAUNCH_BLOCKING handled specially and also run the detection part when the corresponding *_VISIBLE_DEVICES is not present.
@hosekadam @afazekas @rhatdan @marceloleitner @maxamillion
I think this patch could fix it:
https://github.com/containers/ramalama/pull/1475
please test and review.
@hosekadam does:
ramalama serve qwen-coder:7b
work with this patch? If HSA_OVERRIDE_GFX_VERSION=10.3.0 is required, it is possible to add a patch to detect your gpu gfx number and set the env var for your gpu.
@afazekas With the HIP_VISIBLE_DEVICES=0 env var set everything works:
$ ramalama --debug serve qwen-coder:7b --name qwen --ngl 999
...
exec_cmd: podman run --rm --label ai.ramalama.model=ollama://qwen2.5-coder:7b --label ai.ramalama.engine=podman --label ai.ramalama.runtime=llama.cpp --label ai.ramalama.port=8080 --label ai.ramalama.command=serve --device /dev/dri --device /dev/kfd -e HIP_VISIBLE_DEVICES=0 -e HSA_OVERRIDE_GFX_VERSION=10.3.0 -p 8080:8080 --security-opt=label=disable --cap-drop=all --security-opt=no-new-privileges --pull newer -t -i --label ai.ramalama --name qwen --env=HOME=/tmp --init --label ai.ramalama.model=ollama://qwen2.5-coder:7b --label ai.ramalama.engine=podman --label ai.ramalama.runtime=llama.cpp --label ai.ramalama.port=8080 --label ai.ramalama.command=serve --mount=type=bind,src=/home/hosek/.local/share/ramalama/store/ollama/qwen2.5-coder/qwen2.5-coder/blobs/sha256-60e05f2100071479f596b964f89f510f057ce397ea22f2833a0cfe029bfc2463,destination=/mnt/models/model.file,ro --mount=type=bind,src=/home/hosek/.local/share/ramalama/store/ollama/qwen2.5-coder/qwen2.5-coder/snapshots/sha256-60e05f2100071479f596b964f89f510f057ce397ea22f2833a0cfe029bfc2463/chat_template_converted,destination=/mnt/models/chat_template.file,ro quay.io/ramalama/rocm:0.8 /usr/libexec/ramalama/ramalama-serve-core llama-server --port 8080 --model /mnt/models/model.file --alias qwen2.5-coder:7b --ctx-size 2048 --temp 0.8 --jinja --cache-reuse 256 -v -ngl 999 --threads 8 --host 0.0.0.0
...
load_tensors: offloaded 29/29 layers to GPU
@ericcurtin I just tested with HSA_OVERRIDE_GFX_VERSION=10.3.0 set (the HIP_VISIBLE_DEVICES is NOT set) and seems it doesn't work :(
$ ramalama --debug serve qwen-coder:7b --ngl 999 # same result also without the --ngl parameter
...
2025-06-05 15:17:23 - DEBUG - exec_cmd: podman run --rm --label ai.ramalama.model=ollama://qwen2.5-coder:7b --label ai.ramalama.engine=podman --label ai.ramalama.runtime=llama.cpp --label ai.ramalama.port=8080 --label ai.ramalama.command=serve --device /dev/dri --device /dev/kfd -e HSA_OVERRIDE_GFX_VERSION=10.3.0 -p 8080:8080 --security-opt=label=disable --cap-drop=all --security-opt=no-new-privileges --pull newer -t -i --label ai.ramalama --name ramalama_bAtPM2lmxD --env=HOME=/tmp --init --mount=type=bind,src=/home/hosek/.local/share/ramalama/store/ollama/qwen2.5-coder/qwen2.5-coder/blobs/sha256-60e05f2100071479f596b964f89f510f057ce397ea22f2833a0cfe029bfc2463,destination=/mnt/models/model.file,ro --mount=type=bind,src=/home/hosek/.local/share/ramalama/store/ollama/qwen2.5-coder/qwen2.5-coder/snapshots/sha256-60e05f2100071479f596b964f89f510f057ce397ea22f2833a0cfe029bfc2463/chat_template_converted,destination=/mnt/models/chat_template.file,ro quay.io/ramalama/ramalama:0.9 /usr/libexec/ramalama/ramalama-serve-core llama-server --port 8080 --model /mnt/models/model.file --jinja --alias qwen2.5-coder:7b --ctx-size 2048 --temp 0.8 --cache-reuse 256 -v --threads 8 --host 0.0.0.0
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 6700 XT (RADV NAVI22) (radv) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 65536 | int dot: 0 | matrix cores: none
build: 5499 (4265a87b) with cc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5) for x86_64-redhat-linux
system info: n_threads = 8, n_threads_batch = 8, total_threads = 16
...
load_tensors: layer 28 assigned to device CPU, is_swa = 0
load_tensors: tensor 'token_embd.weight' (q4_K) (and 338 others) cannot be used with preferred buffer type Vulkan_Host, using CPU instead
load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 0/29 layers to GPU
load_tensors: CPU_Mapped model buffer size = 4460.45 MiB
I installed the patch from Fedora 42 copr build, but I'm on Fedora 41. I hope it's not a problem - I have in mind some dependencies could change between the versions which could potentially broke detection, so I'm rather mentioning it there. Installed version:
$ rpm -q python3-ramalama
python3-ramalama-0.9.0-1000.1.20250605130250357872.pr1475.17.gff446f9.fc42.noarch
Just to contribute to this issue: I'm using Fedora 42, ramalama on version 0.9.1 from PyPI. Also, I'm using a Radeon RX 6750 XT, which is reported as gfx1031 by rocminfo.
If I just set the HIP_VISIBLE_DEVICES=0, it detects the GPU, but the moment I send anything to the prompt I get an error and the model exits, with the following message:
/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:75: ROCm errorggml_cuda_compute_forward: RMS_NORM failed
ROCm error: invalid device function
current device: 0, in function ggml_cuda_compute_forward at /llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2362
err
/lib64/libggml-base.so(+0x2da5) [0x7f88816d0da5]
/lib64/libggml-base.so(ggml_print_backtrace+0x1ec) [0x7f88816d116c]
/lib64/libggml-base.so(ggml_abort+0xd6) [0x7f88816d1296]
/lib64/libggml-hip.so(+0xd59c2) [0x7f888188e9c2]
/lib64/libggml-hip.so(+0xdbe69) [0x7f8881894e69]
/lib64/libggml-base.so(ggml_backend_sched_graph_compute_async+0x400) [0x7f88816e6570]
/lib64/libllama.so(_ZN13llama_context13graph_computeEP11ggml_cgraphb+0x90) [0x7f889874eda0]
/lib64/libllama.so(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP20llama_memory_state_iR11ggml_status+0xe9) [0x7f889874f329]
/lib64/libllama.so(_ZN13llama_context6decodeER11llama_batch+0x84d) [0x7f889875460d]
/lib64/libllama.so(llama_decode+0xe) [0x7f88987556ce]
llama-server() [0x49aedc]
llama-server() [0x46c659]
llama-server() [0x43104c]
/lib64/libc.so.6(+0x35f5) [0x7f888114e5f5]
/lib64/libc.so.6(__libc_start_main+0x88) [0x7f888114e6a8]
llama-server() [0x432c35]
Yet, if I use HIP_VISIBLE_DEVICES=0 and also set HSA_OVERRIDE_GFX_VERSION=10.3.0, the GPU is then used as expected.
A friendly reminder that this issue had no activity for 30 days.
If that is the case should we report an issue upstream in llama.cpp, or does RamaLama need to change something for this to work?
A friendly reminder that this issue had no activity for 30 days.
Since we never heard back steps forward on whether this is something RamaLama can fix or something needing to be addressed in llama.cpp, closing. Reopen if the information changes.