ramalama Radeon RX 6700 XT not utilized

I updated ramalama from 0.5.4 to 0.7.2 (it just worked as I needed and I forgot to update it, great job btw!) because I want to experiment with the rag option and I see my gpu is not fully utilized in the ramalama serve command. I dig into it a bit and this issue happens from 0.6.0. It happens even with the --ngl 999 set, in the 0.5.4 I was using --gpu switch for forcing gpu.

I have Radeon RX 6700 XT, HSA_OVERRIDE_GFX_VERSION=10.3.0 env var for overriding it to be supported ) is set, Fedora 41, podman 5.4.1.

Ramalama 0.7.2

Bench command

I see the gpu is detected and in the bench command is utilized.

Output from ramalama --debug bench qwen-coder:14b with screenshot from radeontop on ramalama==0.7.2:

$ ramalama --debug bench qwen-coder:14b
run_cmd:  podman inspect quay.io/ramalama/ramalama:0.7
Working directory: None
Ignore stderr: False
Ignore all: True
Command finished with return code: 0
run_cmd:  podman inspect quay.io/ramalama/ramalama:0.7
Working directory: None
Ignore stderr: False
Ignore all: True
Command finished with return code: 0
exec_cmd:  podman run --rm -i --label ai.ramalama --name ramalama_ugaXCEl6nT --env=HOME=/tmp --init --security-opt=label=disable --cap-drop=all --security-opt=no-new-privileges --label ai.ramalama.model=ollama://qwen2.5-coder:14b --label ai.ramalama.engine=podman --label ai.ramalama.runtime=llama.cpp --label ai.ramalama.command=bench --pull=newer -t --device /dev/dri --device /dev/kfd -e HSA_OVERRIDE_GFX_VERSION=10.3.0 --network none --mount=type=bind,src=/home/hosek/.local/share/ramalama/models/ollama/qwen2.5-coder:14b,destination=/mnt/models/model.file,ro quay.io/ramalama/ramalama:0.7 llama-bench --threads 8 -m /mnt/models/model.file
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 6700 XT (RADV NAVI22) (radv) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 65536 | matrix cores: none
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |         pp512 |        252.24 ± 0.51 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |         tg128 |         35.81 ± 0.05 |

build: ef19c717 (4958)

GPU: CPU:

Serve command

When I run ramalama --debug serve qwen-coder:14b --name qwen --detach -p 12401 --ngl 999 the output is (with utilization during work with model):

$ ramalama --debug serve qwen-coder:14b --name qwen --detach -p 12401 --ngl 999
run_cmd:  podman inspect quay.io/ramalama/ramalama:0.7
Working directory: None
Ignore stderr: False
Ignore all: True
Command finished with return code: 0
run_cmd:  podman inspect quay.io/ramalama/ramalama:0.7
Working directory: None
Ignore stderr: False
Ignore all: True
Command finished with return code: 0
run_cmd:  podman inspect quay.io/ramalama/ramalama:0.7
Working directory: None
Ignore stderr: False
Ignore all: True
Command finished with return code: 0
exec_cmd:  podman run --rm -i --label ai.ramalama --name qwen --env=HOME=/tmp --init --security-opt=label=disable --cap-drop=all --security-opt=no-new-privileges --label ai.ramalama.model=ollama://qwen2.5-coder:14b --label ai.ramalama.engine=podman --label ai.ramalama.runtime=llama.cpp --label ai.ramalama.port=12401 --label ai.ramalama.command=serve --pull=newer -t -d -p 12401:12401 --device /dev/dri --device /dev/kfd -e HSA_OVERRIDE_GFX_VERSION=10.3.0 --mount=type=bind,src=/home/hosek/.local/share/ramalama/models/ollama/qwen2.5-coder:14b,destination=/mnt/models/model.file,ro quay.io/ramalama/ramalama:0.7 llama-server --port 12401 --model /mnt/models/model.file --alias qwen2.5-coder:14b --ctx-size 2048 --temp 0.8 -v --threads 8 --host 0.0.0.0
3836dd741633060d5ec96f0ba314d4a00b08ce89210147b3ed3f10a4da82d558

GPU (is doing something in the beggining - this is peak, but when starts generating text, it's off and all the work does the CPU): CPU (during generation):

Ramalama 0.5.5 (latest working version)

For comparison utilization and outputs from the 0.5.5:

Bench

$ ramalama --debug bench qwen-coder:14b
run_cmd:  podman inspect quay.io/ramalama/ramalama:0.5
Working directory: None
Ignore stderr: False
Ignore all: True
exec_cmd:  podman run --rm -i --label RAMALAMA --security-opt=label=disable --name ramalama_MYV4dXiypb --pull=newer -t --device /dev/dri --device /dev/kfd -e HSA_OVERRIDE_GFX_VERSION=10.3.0 --mount=type=bind,src=/home/hosek/.local/share/ramalama/models/ollama/qwen2.5-coder:14b,destination=/mnt/models/model.file,ro quay.io/ramalama/ramalama:latest llama-bench -m /mnt/models/model.file
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 6700 XT (RADV NAVI22) (radv) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 65536 | matrix cores: none
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |         pp512 |        251.02 ± 0.28 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |         tg128 |         30.72 ± 0.06 |

build: ef19c717 (4958)

Serve

$ ramalama --gpu --debug serve qwen-coder:14b --name qwen --detach -p 12401
run_cmd:  podman inspect quay.io/ramalama/ramalama:0.5
Working directory: None
Ignore stderr: False
Ignore all: True
exec_cmd:  podman run --rm -i --label RAMALAMA --security-opt=label=disable --name qwen --pull=newer -t -d -p 12401:12401 --device /dev/dri --device /dev/kfd -e HSA_OVERRIDE_GFX_VERSION=10.3.0 --mount=type=bind,src=/home/hosek/.local/share/ramalama/models/ollama/qwen2.5-coder:14b,destination=/mnt/models/model.file,ro quay.io/ramalama/ramalama:latest llama-server --port 12401 -m /mnt/models/model.file -c 2048 --temp 0.8 -ngl 999 --host 0.0.0.0
4ed40093b328a33eb05529c77737006f4eabf4ae9dddf2a8b6b27c27f4fbf61f

GPU (much faster generation, GPU is definitely used): CPU (during generation)

Apr 07 '25 11:04 hosekadam

One major change that happened between ramalama 0.5.4 and 0.7.2 is we switched from the UBI + AMD ROCm packages for the runtime image to Fedora 42 using Fedora's ROCm packages because the Fedora ROCm maintainer enabled a lot more AMD gfx series and that provided the most coverage of consumer hardware.

However I don't think that's you issue because the command output indicates it is using the quay.io/ramalama/ramalama:latest container image which doesn't contain any of the AMD ROCm accelerator enabled machinery. In light of that, I think is weird is that it's touching your GPU at all for the bench run. Something is wonky there.

Can you please provide the output of the following command? rocminfo | grep gfx

Also, can you try this command for your serving attempt and see if that changes the behavior at all? (this will force the use of the AMD ROCm enabled image, which is seems like ramalama isn't detecting you card correctly and selecting that image):

ramalama --image=quay.io/ramalama/rocm --gpu --debug serve qwen-coder:14b --name qwen --detach -p 12401

Apr 15 '25 14:04 maxamillion

rocminfo | grep gfx:

$ rocminfo | grep gfx
  Name:                    gfx1030                            
      Name:                    amdgcn-amd-amdhsa--gfx1030

ramalama --image=quay.io/ramalama/rocm --debug serve qwen-coder:14b --name qwen --detach -p 12401 ramalma: 0.7.2 (without --gpu as it's not supported)

$ ramalama --image=quay.io/ramalama/rocm --debug serve qwen-coder:14b --name qwen --detach -p 12401
exec_cmd:  podman run --rm -i --label ai.ramalama --name qwen --env=HOME=/tmp --init --security-opt=label=disable --cap-drop=all --security-opt=no-new-privileges --label ai.ramalama.model=ollama://qwen2.5-coder:14b --label ai.ramalama.engine=podman --label ai.ramalama.runtime=llama.cpp --label ai.ramalama.port=12401 --label ai.ramalama.command=serve --pull=newer -t -d -p 12401:12401 --device /dev/dri --device /dev/kfd -e HSA_OVERRIDE_GFX_VERSION=10.3.0 --mount=type=bind,src=/home/hosek/.local/share/ramalama/models/ollama/qwen2.5-coder:14b,destination=/mnt/models/model.file,ro quay.io/ramalama/rocm llama-server --port 12401 --model /mnt/models/model.file --alias qwen2.5-coder:14b --ctx-size 2048 --temp 0.8 -v --threads 8 --host 0.0.0.0
Trying to pull quay.io/ramalama/rocm:latest...
Getting image source signatures
Copying blob b931a26c663b done   | 
Copying blob fc454ec91464 done   | 
Copying blob f5b2012436d8 done   | 
Copying config 4d428a2f62 done   | 
Writing manifest to image destination
604036a057accfe803ef4dbe429ce30ac2cc834a7e012c90abac6a7f5157a7f6

and the gpu not being utilized

ramalama --image=quay.io/ramalama/rocm --debug serve qwen-coder:14b --name qwen --detach -p 12401 ramalma: 0.7.4 (without --gpu as it's not supported)

$ ramalama --image=quay.io/ramalama/rocm --debug serve qwen-coder:14b --name qwen --detach -p 12401
exec_cmd:  podman run --rm -i --label ai.ramalama --name qwen --env=HOME=/tmp --init --security-opt=label=disable --cap-drop=all --security-opt=no-new-privileges --label ai.ramalama.model=ollama://qwen2.5-coder:14b --label ai.ramalama.engine=podman --label ai.ramalama.runtime=llama.cpp --label ai.ramalama.port=12401 --label ai.ramalama.command=serve --pull=newer -t -d -p 12401:12401 --device /dev/dri --device /dev/kfd -e HSA_OVERRIDE_GFX_VERSION=10.3.0 --mount=type=bind,src=/home/hosek/.local/share/ramalama/models/ollama/qwen2.5-coder:14b,destination=/mnt/models/model.file,ro quay.io/ramalama/rocm:0.7 llama-server --port 12401 --model /mnt/models/model.file --alias qwen2.5-coder:14b --ctx-size 2048 --temp 0.8 --jinja --cache-reuse 256 -v --threads 8 --host 0.0.0.0
Trying to pull quay.io/ramalama/rocm:0.7...
Getting image source signatures
Copying blob b931a26c663b skipped: already exists  
Copying blob fc454ec91464 skipped: already exists  
Copying blob f5b2012436d8 skipped: already exists  
Copying config 4d428a2f62 done   | 
Writing manifest to image destination
d4f43c371c4c3b4b30300dde6a8c8d3f8f6430be4f80fb3060b37a62890b7ac8

still the same, gpu not utilized

If there is anything else I could try/provide etc., feel free to ping me. I like the project, so it's a pleasure for me to help you with making it better.

Apr 16 '25 15:04 hosekadam

RTX 6900 works for me (main branch).

I do not see ngl passed in the podman invocation from your output: exec_cmd: podman run --rm -i --label ai.ramalama --name qwen --env=HOME=/tmp --init --security-opt=label=disable --cap-drop=all --security-opt=no-new-privileges --label ai.ramalama.model=ollama://qwen2.5-coder:14b --label ai.ramalama.engine=podman --label ai.ramalama.runtime=llama.cpp --label ai.ramalama.port=12401 --label ai.ramalama.command=serve --pull=newer -t -d -p 12401:12401 --device /dev/dri --device /dev/kfd -e HSA_OVERRIDE_GFX_VERSION=10.3.0 --mount=type=bind,src=/home/hosek/.local/share/ramalama/models/ollama/qwen2.5-coder:14b,destination=/mnt/models/model.file,ro quay.io/ramalama/ramalama:0.7 llama-server --port 12401 --model /mnt/models/model.file --alias qwen2.5-coder:14b --ctx-size 2048 --temp 0.8 -v --threads 8 --host 0.0.0.0

mine: ramalama --debug serve qwen2.5-coder:14b --name qwen --detach -p 12401 --ngl 999 exec_cmd: podman run --rm --label ai.ramalama.model=qwen2.5-coder:14b --label ai.ramalama.engine=podman --label ai.ramalama.runtime=llama.cpp --label ai.ramalama.port=12401 --label ai.ramalama.command=serve --device /dev/dri --device /dev/kfd -e HIP_VISIBLE_DEVICES=0 -p 12401:12401 --security-opt=label=disable --cap-drop=all --security-opt=no-new-privileges --pull newer -t -d -i --label ai.ramalama --name qwen --env=HOME=/tmp --init --label ai.ramalama.model=qwen2.5-coder:14b --label ai.ramalama.engine=podman --label ai.ramalama.runtime=llama.cpp --label ai.ramalama.port=12401 --label ai.ramalama.command=serve --mount=type=bind,src=/home/turul/.local/share/ramalama/store/ollama/qwen2.5-coder/qwen2.5-coder/blobs/sha256-ac9bc7a69dab38da1c790838955f1293420b55ab555ef6b4615efa1c1507b1ed,destination=/mnt/models/model.file,ro --mount=type=bind,src=/home/turul/.local/share/ramalama/store/ollama/qwen2.5-coder/qwen2.5-coder/snapshots/sha256-ac9bc7a69dab38da1c790838955f1293420b55ab555ef6b4615efa1c1507b1ed/chat_template_converted,destination=/mnt/models/chat_template.file,ro quay.io/ramalama/rocm:0.7 llama-server --port 12401 --model /mnt/models/model.file --alias qwen2.5-coder:14b --ctx-size 2048 --temp 0.8 --jinja --cache-reuse 256 -v -ngl 999 --threads 12 --host 0.0.0.0 d6343e513bf23751718a49e7703a912dda9c7b4d8dbb2a9456c78bb05a81e54f

Apr 24 '25 13:04 afazekas

Hi @hosekadam . Please make sure you have proper permissions for /dev/dri and /dev/kfd. I was testing on a RHEL 9.5 with the ROCm packages from AMD and they were bogus. After fixing it, it worked for me.

May 09 '25 13:05 marceloleitner

@marceloleitner Thanks for your comment! I just checked and:

$ ls -ld /dev/dri
drwxr-xr-x. 3 root root 100 May 19 11:28 /dev/dri
$ ls -ld /dev/kfd
crw-rw-rw-. 1 root render 235, 0 May 19 11:28 /dev/kfd

Are those expected permissions, or should I change to something different?

@afazekas Thanks for your comment too! Is there anything you can recommend me to try? I'm sorry, but I don't know what to do to have ngl values passed. I'm adding it only with the problematic version, for 0.5.5 I'm not passing it as GPU works.

I also quickly checked with the latest version ramalama-0.8.3 and the problem still persists - gpu is not being utilized.

May 19 '25 18:05 hosekadam

Since ramalama tring to use the ramalma image instead of the rocm image likely it misses something about the GPU. I do not remember anything special about f41and rocm it was OK for me.

Even tough the vulcan (ramalma image) should work with the GPU too .

My typical issue with rocm was the in cpu(amd) gpu confused many tools, the simplest solution for that is simply disable the in-cpu VGA in BIOS.

Probably checking/debug printing around the container selection logic might give a hint.

May 20 '25 14:05 afazekas

I played with the latest ramalama a bit more and here are some results (if it could help with debugging):

I have AMD 5700x which doesn't have any integrated GPU, so I don't expect the problem could be there
about the working Radeon RX 6900 XT it would be because it's officially supported. The RX 6700 XT is not, I need to use HSA_OVERRIDE_GFX_VERSION=10.3.0 as described under the officially supported GPUs
ramalama from fedora repo instead of pip
I have AMD 5700x which doesn't have any integrated GPU, so I don't expect the problem could be there
different models, including smaller versions
running without detach parameter and forcing the rocm image provides a bit more info: command: $ ramalama --debug --image=quay.io/ramalama/rocm serve qwen-coder:7b --name qwen --ngl 999 output (shorted):

exec_cmd:  podman run --rm --label ai.ramalama.model=ollama://qwen2.5-coder:7b --label ai.ramalama.engine=podman --label ai.ramalama.runtime=llama.cpp --label ai.ramalama.port=8080 --label ai.ramalama.command=serve --device /dev/dri --device /dev/kfd -e HSA_OVERRIDE_GFX_VERSION=10.3.0 -p 8080:8080 --security-opt=label=disable --cap-drop=all --security-opt=no-new-privileges --pull newer -t -i --label ai.ramalama --name qwen --env=HOME=/tmp --init --label ai.ramalama.model=ollama://qwen2.5-coder:7b --label ai.ramalama.engine=podman --label ai.ramalama.runtime=llama.cpp --label ai.ramalama.port=8080 --label ai.ramalama.command=serve --mount=type=bind,src=/home/hosek/.local/share/ramalama/store/ollama/qwen2.5-coder/qwen2.5-coder/blobs/sha256-60e05f2100071479f596b964f89f510f057ce397ea22f2833a0cfe029bfc2463,destination=/mnt/models/model.file,ro --mount=type=bind,src=/home/hosek/.local/share/ramalama/store/ollama/qwen2.5-coder/qwen2.5-coder/snapshots/sha256-60e05f2100071479f596b964f89f510f057ce397ea22f2833a0cfe029bfc2463/chat_template_converted,destination=/mnt/models/chat_template.file,ro quay.io/ramalama/rocm:0.8 /usr/libexec/ramalama/ramalama-serve-core llama-server --port 8080 --model /mnt/models/model.file --alias qwen2.5-coder:7b --ctx-size 2048 --temp 0.8 --jinja --cache-reuse 256 -v --threads 8 --host 0.0.0.0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 6700 XT, gfx1030 (0x1030), VMM: no, Wave Size: 32
build: 5429 (e298d2fb) with cc (GCC) 15.1.1 20250521 (Red Hat 15.1.1-2) for x86_64-redhat-linux
system info: n_threads = 8, n_threads_batch = 8, total_threads = 16

system_info: n_threads = 8 (n_threads_batch = 8) / 16 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
...
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon RX 6700 XT) - 12222 MiB free
...
load_tensors: tensor 'token_embd.weight' (q4_K) (and 338 others) cannot be used with preferred buffer type ROCm_Host, using CPU instead
load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 0/29 layers to GPU
load_tensors:   CPU_Mapped model buffer size =  4460.45 MiB

Based on the output, I expect the GPU is correctly detected, but in some process of the loader is not used. Does anybody know what's happening there and possibly how can I fix it?

Jun 04 '25 08:06 hosekadam

Might be better off asking at llama.cpp to see if they have an idea what is going on. @ericcurtin @ggerganov Thoughts?

Jun 05 '25 06:06 rhatdan

@hosekadam

load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 0/29 layers to GPU

Think you are missing -ngl 999 in your command.

Jun 05 '25 06:06 ggerganov

@ggerganov I have --ngl 999 in the command I'm running, it's the last parameter. I rather tried it now to make sure if I didn't forget it, but I didn't - exact same result as before. But I don't see it in exec_cmd: in ramalama output. Should it be passed there?

I'll ask at llama.cpp and provide solution there later (I hope there will be some).

Jun 05 '25 07:06 hosekadam

exec_cmd: podman run ... /usr/libexec/ramalama/ramalama-serve-core llama-server --port 8080 --model /mnt/models/model.file --alias qwen2.5-coder:7b --ctx-size 2048 --temp 0.8 --jinja --cache-reuse 256 -v --threads 8 --host 0.0.0.0

I am not sure the --ngl line is being passed?

@ericcurtin ^^

Jun 05 '25 08:06 rhatdan

I wonder what these commands do in your system: echo /sys/bus/pci/devices//mem_info_vram_total cat /sys/bus/pci/devices//mem_info_vram_total

A number above 1073741824 should enable rocm usage by default.

--ngl might not be passed when no rocm (accel method) detected.

you might try --runtime-args="--ngl 99" or --runtime-args="-ngl 99" to bypass this. We might not need to filter the ngl argument in case of no_accel since llama.cpp can ignore it when no GPU detected.

Jun 05 '25 10:06 afazekas

@afazekas I tried running the command $ cat /sys/bus/pci/devices/0000:0a:00.1/mem_info_vram_total where the 0000:0a:00.0 I identified from output of lspci where the GPU was specified like 0a:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 22 [Radeon RX 6700/6700 XT/6750 XT / 6800M/6850M XT] [1002:73df] (rev c5) (describing if anybody else has the same issue) and the output is 12868124672:

$ cat /sys/bus/pci/devices/0000:0a:00.0/mem_info_vram_total
12868124672

Running $ ramalama --debug --image=quay.io/ramalama/rocm serve qwen-coder:7b --name qwen --runtime-args="-ngl 999" results in using the GPU! 🎉

$ ramalama --debug --image=quay.io/ramalama/rocm serve qwen-coder:7b --name qwen --runtime-args="-ngl 999"
...
exec_cmd:  podman run --rm --label ai.ramalama.model=ollama://qwen2.5-coder:7b --label ai.ramalama.engine=podman --label ai.ramalama.runtime=llama.cpp --label ai.ramalama.port=8083 --label ai.ramalama.command=serve --device /dev/dri --device /dev/kfd -e HSA_OVERRIDE_GFX_VERSION=10.3.0 -p 8083:8083 --security-opt=label=disable --cap-drop=all --security-opt=no-new-privileges --pull newer -t -i --label ai.ramalama --name qwen --env=HOME=/tmp --init --label ai.ramalama.model=ollama://qwen2.5-coder:7b --label ai.ramalama.engine=podman --label ai.ramalama.runtime=llama.cpp --label ai.ramalama.port=8083 --label ai.ramalama.command=serve --mount=type=bind,src=/home/hosek/.local/share/ramalama/store/ollama/qwen2.5-coder/qwen2.5-coder/blobs/sha256-60e05f2100071479f596b964f89f510f057ce397ea22f2833a0cfe029bfc2463,destination=/mnt/models/model.file,ro --mount=type=bind,src=/home/hosek/.local/share/ramalama/store/ollama/qwen2.5-coder/qwen2.5-coder/snapshots/sha256-60e05f2100071479f596b964f89f510f057ce397ea22f2833a0cfe029bfc2463/chat_template_converted,destination=/mnt/models/chat_template.file,ro quay.io/ramalama/rocm:0.8 /usr/libexec/ramalama/ramalama-serve-core llama-server --port 8083 --model /mnt/models/model.file --alias qwen2.5-coder:7b --ctx-size 2048 --temp 0.8 --jinja --cache-reuse 256 -ngl 999 -v --threads 8 --host 0.0.0.0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 6700 XT, gfx1030 (0x1030), VMM: no, Wave Size: 32
build: 5429 (e298d2fb) with cc (GCC) 15.1.1 20250521 (Red Hat 15.1.1-2) for x86_64-redhat-linux
system info: n_threads = 8, n_threads_batch = 8, total_threads = 16

system_info: n_threads = 8 (n_threads_batch = 8) / 16 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
...
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer   0 assigned to device ROCm0, is_swa = 0
load_tensors: layer   1 assigned to device ROCm0, is_swa = 0
load_tensors: layer   2 assigned to device ROCm0, is_swa = 0
load_tensors: layer   3 assigned to device ROCm0, is_swa = 0
load_tensors: layer   4 assigned to device ROCm0, is_swa = 0
load_tensors: layer   5 assigned to device ROCm0, is_swa = 0
load_tensors: layer   6 assigned to device ROCm0, is_swa = 0
load_tensors: layer   7 assigned to device ROCm0, is_swa = 0
load_tensors: layer   8 assigned to device ROCm0, is_swa = 0
load_tensors: layer   9 assigned to device ROCm0, is_swa = 0
load_tensors: layer  10 assigned to device ROCm0, is_swa = 0
load_tensors: layer  11 assigned to device ROCm0, is_swa = 0
load_tensors: layer  12 assigned to device ROCm0, is_swa = 0
load_tensors: layer  13 assigned to device ROCm0, is_swa = 0
load_tensors: layer  14 assigned to device ROCm0, is_swa = 0
load_tensors: layer  15 assigned to device ROCm0, is_swa = 0
load_tensors: layer  16 assigned to device ROCm0, is_swa = 0
load_tensors: layer  17 assigned to device ROCm0, is_swa = 0
load_tensors: layer  18 assigned to device ROCm0, is_swa = 0
load_tensors: layer  19 assigned to device ROCm0, is_swa = 0
load_tensors: layer  20 assigned to device ROCm0, is_swa = 0
load_tensors: layer  21 assigned to device ROCm0, is_swa = 0
load_tensors: layer  22 assigned to device ROCm0, is_swa = 0
load_tensors: layer  23 assigned to device ROCm0, is_swa = 0
load_tensors: layer  24 assigned to device ROCm0, is_swa = 0
load_tensors: layer  25 assigned to device ROCm0, is_swa = 0
load_tensors: layer  26 assigned to device ROCm0, is_swa = 0
load_tensors: layer  27 assigned to device ROCm0, is_swa = 0
load_tensors: layer  28 assigned to device ROCm0, is_swa = 0
load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type ROCm_Host, using CPU instead
load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   292.36 MiB
load_tensors:        ROCm0 model buffer size =  4168.09 MiB

It still complains about tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type ROCm_Host, using CPU instead but I believe it's not a problem as the GPU is utilized.

❗ Thanks for the help with this issue! It was about not passing the --ngl parameter into the podman command executed by ramalama and need to specify it manually by adding --runtime-args="-ngl 999" (also for others with this issue quickly scrolling over this discussion :))

Is this going to be fixed in some later versions, or is it expected behavior with my setup for some reason?

Jun 05 '25 10:06 hosekadam

Looks like when HSA_OVERRIDE_GFX_VERSION is specified the gpu detection is not running so the HIP_VISIBLE_DEVICES=0 is not set.

Jun 05 '25 11:06 afazekas

You can set the HIP_VISIBLE_DEVICES=0 when you set HSA_OVERRIDE_GFX_VERSION. Can you check is it solves everything even the automatic image selection ?

The code consider 'HSA_VISIBLE_DEVICES' also gpu env, which likely not need/used anywhere, probbaly should be removed.

The ngl append logic either should not check anything (passing --ngl even without accel, AFAIK it can work, but it should be tested to be sure. )

Alternatively HSA_OVERRIDE_GFX_VERSION and CUDA_LAUNCH_BLOCKING handled specially and also run the detection part when the corresponding *_VISIBLE_DEVICES is not present.

Jun 05 '25 12:06 afazekas

@hosekadam @afazekas @rhatdan @marceloleitner @maxamillion

I think this patch could fix it:

https://github.com/containers/ramalama/pull/1475

please test and review.

Jun 05 '25 12:06 ericcurtin

@hosekadam does:

ramalama serve qwen-coder:7b

work with this patch? If HSA_OVERRIDE_GFX_VERSION=10.3.0 is required, it is possible to add a patch to detect your gpu gfx number and set the env var for your gpu.

Jun 05 '25 12:06 ericcurtin

@afazekas With the HIP_VISIBLE_DEVICES=0 env var set everything works:

$ ramalama --debug serve qwen-coder:7b --name qwen --ngl 999
...
exec_cmd:  podman run --rm --label ai.ramalama.model=ollama://qwen2.5-coder:7b --label ai.ramalama.engine=podman --label ai.ramalama.runtime=llama.cpp --label ai.ramalama.port=8080 --label ai.ramalama.command=serve --device /dev/dri --device /dev/kfd -e HIP_VISIBLE_DEVICES=0 -e HSA_OVERRIDE_GFX_VERSION=10.3.0 -p 8080:8080 --security-opt=label=disable --cap-drop=all --security-opt=no-new-privileges --pull newer -t -i --label ai.ramalama --name qwen --env=HOME=/tmp --init --label ai.ramalama.model=ollama://qwen2.5-coder:7b --label ai.ramalama.engine=podman --label ai.ramalama.runtime=llama.cpp --label ai.ramalama.port=8080 --label ai.ramalama.command=serve --mount=type=bind,src=/home/hosek/.local/share/ramalama/store/ollama/qwen2.5-coder/qwen2.5-coder/blobs/sha256-60e05f2100071479f596b964f89f510f057ce397ea22f2833a0cfe029bfc2463,destination=/mnt/models/model.file,ro --mount=type=bind,src=/home/hosek/.local/share/ramalama/store/ollama/qwen2.5-coder/qwen2.5-coder/snapshots/sha256-60e05f2100071479f596b964f89f510f057ce397ea22f2833a0cfe029bfc2463/chat_template_converted,destination=/mnt/models/chat_template.file,ro quay.io/ramalama/rocm:0.8 /usr/libexec/ramalama/ramalama-serve-core llama-server --port 8080 --model /mnt/models/model.file --alias qwen2.5-coder:7b --ctx-size 2048 --temp 0.8 --jinja --cache-reuse 256 -v -ngl 999 --threads 8 --host 0.0.0.0
...
load_tensors: offloaded 29/29 layers to GPU

@ericcurtin I just tested with HSA_OVERRIDE_GFX_VERSION=10.3.0 set (the HIP_VISIBLE_DEVICES is NOT set) and seems it doesn't work :(

$ ramalama --debug serve qwen-coder:7b --ngl 999 # same result also without the --ngl parameter
...
2025-06-05 15:17:23 - DEBUG - exec_cmd: podman run --rm --label ai.ramalama.model=ollama://qwen2.5-coder:7b --label ai.ramalama.engine=podman --label ai.ramalama.runtime=llama.cpp --label ai.ramalama.port=8080 --label ai.ramalama.command=serve --device /dev/dri --device /dev/kfd -e HSA_OVERRIDE_GFX_VERSION=10.3.0 -p 8080:8080 --security-opt=label=disable --cap-drop=all --security-opt=no-new-privileges --pull newer -t -i --label ai.ramalama --name ramalama_bAtPM2lmxD --env=HOME=/tmp --init --mount=type=bind,src=/home/hosek/.local/share/ramalama/store/ollama/qwen2.5-coder/qwen2.5-coder/blobs/sha256-60e05f2100071479f596b964f89f510f057ce397ea22f2833a0cfe029bfc2463,destination=/mnt/models/model.file,ro --mount=type=bind,src=/home/hosek/.local/share/ramalama/store/ollama/qwen2.5-coder/qwen2.5-coder/snapshots/sha256-60e05f2100071479f596b964f89f510f057ce397ea22f2833a0cfe029bfc2463/chat_template_converted,destination=/mnt/models/chat_template.file,ro quay.io/ramalama/ramalama:0.9 /usr/libexec/ramalama/ramalama-serve-core llama-server --port 8080 --model /mnt/models/model.file --jinja --alias qwen2.5-coder:7b --ctx-size 2048 --temp 0.8 --cache-reuse 256 -v --threads 8 --host 0.0.0.0
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 6700 XT (RADV NAVI22) (radv) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 65536 | int dot: 0 | matrix cores: none
build: 5499 (4265a87b) with cc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5) for x86_64-redhat-linux
system info: n_threads = 8, n_threads_batch = 8, total_threads = 16
...
load_tensors: layer  28 assigned to device CPU, is_swa = 0
load_tensors: tensor 'token_embd.weight' (q4_K) (and 338 others) cannot be used with preferred buffer type Vulkan_Host, using CPU instead
load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 0/29 layers to GPU
load_tensors:   CPU_Mapped model buffer size =  4460.45 MiB

I installed the patch from Fedora 42 copr build, but I'm on Fedora 41. I hope it's not a problem - I have in mind some dependencies could change between the versions which could potentially broke detection, so I'm rather mentioning it there. Installed version:

$ rpm -q  python3-ramalama
python3-ramalama-0.9.0-1000.1.20250605130250357872.pr1475.17.gff446f9.fc42.noarch

Jun 05 '25 13:06 hosekadam

Just to contribute to this issue: I'm using Fedora 42, ramalama on version 0.9.1 from PyPI. Also, I'm using a Radeon RX 6750 XT, which is reported as gfx1031 by rocminfo.

If I just set the HIP_VISIBLE_DEVICES=0, it detects the GPU, but the moment I send anything to the prompt I get an error and the model exits, with the following message:

/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:75: ROCm errorggml_cuda_compute_forward: RMS_NORM failed
ROCm error: invalid device function
  current device: 0, in function ggml_cuda_compute_forward at /llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2362
  err

/lib64/libggml-base.so(+0x2da5) [0x7f88816d0da5]
/lib64/libggml-base.so(ggml_print_backtrace+0x1ec) [0x7f88816d116c]
/lib64/libggml-base.so(ggml_abort+0xd6) [0x7f88816d1296]
/lib64/libggml-hip.so(+0xd59c2) [0x7f888188e9c2]
/lib64/libggml-hip.so(+0xdbe69) [0x7f8881894e69]
/lib64/libggml-base.so(ggml_backend_sched_graph_compute_async+0x400) [0x7f88816e6570]
/lib64/libllama.so(_ZN13llama_context13graph_computeEP11ggml_cgraphb+0x90) [0x7f889874eda0]
/lib64/libllama.so(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP20llama_memory_state_iR11ggml_status+0xe9) [0x7f889874f329]
/lib64/libllama.so(_ZN13llama_context6decodeER11llama_batch+0x84d) [0x7f889875460d]
/lib64/libllama.so(llama_decode+0xe) [0x7f88987556ce]
llama-server() [0x49aedc]
llama-server() [0x46c659]
llama-server() [0x43104c]
/lib64/libc.so.6(+0x35f5) [0x7f888114e5f5]
/lib64/libc.so.6(__libc_start_main+0x88) [0x7f888114e6a8]
llama-server() [0x432c35]

Yet, if I use HIP_VISIBLE_DEVICES=0 and also set HSA_OVERRIDE_GFX_VERSION=10.3.0, the GPU is then used as expected.

Jun 17 '25 15:06 decko

A friendly reminder that this issue had no activity for 30 days.

Jul 25 '25 00:07 github-actions[bot]

If that is the case should we report an issue upstream in llama.cpp, or does RamaLama need to change something for this to work?

Jul 25 '25 10:07 rhatdan

A friendly reminder that this issue had no activity for 30 days.

Aug 26 '25 00:08 github-actions[bot]

Since we never heard back steps forward on whether this is something RamaLama can fix or something needing to be addressed in llama.cpp, closing. Reopen if the information changes.

Aug 26 '25 12:08 rhatdan