Using ramalama server or run with --rag induces a core dump
Issue Description
I'm attempting to experiment with ramalama and RAG (per https://developers.redhat.com/articles/2025/04/03/simplify-ai-data-integration-ramalama-and-rag) and am running into an issue. When I try to serve or run ramalama with rag (i.e. $ ramalama run --rag quay.io/myrepository/ragdata MODEL) it drops into a prompt ('>'), without the prompt prefix in both cases, and if you enter anything in the prompt you get a stack trace ending in an APIConnectionError followed by a core dump. If you leave the prompt up and try curl localhost:8080 you'll get a connection reset, which is the same message in the aforementioned APIConnectionError. Seems like llama-server (or something else?) on the image is hanging, then crashing when it gets any kind of interaction. I even ran the command and when the prompt was available, attempted podman attach and that caused a core dump too.
Steps to reproduce the issue
- Run the following (using real files, etc):
$ ramalama rag test.pdf quay.io/myrepository/ragdata
....
$ ramalama run --rag quay.io/myrepository/ragdata mistral:latest
- Observe the process borking
Describe the results you received
$ ramalama --debug run --rag quay.io/csutherl/ragdata mistral:latest
run_cmd: podman image inspect quay.io/csutherl/ragdata
Working directory: None
Ignore stderr: False
Ignore all: False
Command finished with return code: 0
Checking if 8080 is available
run_cmd: podman inspect quay.io/ramalama/intel-gpu-rag:0.7
Working directory: None
Ignore stderr: False
Ignore all: True
Command finished with return code: 0
run_cmd: podman inspect quay.io/ramalama/intel-gpu-rag:0.7
Working directory: None
Ignore stderr: False
Ignore all: True
Command finished with return code: 0
exec_cmd: podman run --rm -i --label ai.ramalama --name ramalama_oIzcQcNjjf --env=HOME=/tmp --init --security-opt=label=disable --cap-drop=all --security-opt=no-new-privileges --label ai.ramalama.model=mistral:latest --label ai.ramalama.engine=podman --label ai.ramalama.runtime=llama.cpp --label ai.ramalama.port=8080 --label ai.ramalama.command=run --env LLAMA_PROMPT_PREFIX=🦭 > --pull=newer -t -p 8080:8080 --device /dev/dri --device /dev/accel -e INTEL_VISIBLE_DEVICES=1 --network bridge --mount=type=image,source=quay.io/csutherl/ragdata,destination=/rag,rw=true --mount=type=bind,src=/home/csutherl/.local/share/ramalama/models/ollama/mistral:latest,destination=/mnt/models/model.file,ro quay.io/ramalama/intel-gpu-rag:0.7 bash -c nohup llama-server --port 8080 --model /mnt/models/model.file --alias mistral:latest --ctx-size 2048 --temp 0.8 --jinja --cache-reuse 256 -v -ngl 999 --threads 11 --host 0.0.0.0 &> /tmp/llama-server.log & rag_framework run /rag/vector.db
> test
Traceback (most recent call last):
File "/usr/local/lib/python3.13/site-packages/httpx/_transports/default.py", line 101, in map_httpcore_exceptions
yield
File "/usr/local/lib/python3.13/site-packages/httpx/_transports/default.py", line 250, in handle_request
resp = self._pool.handle_request(req)
File "/usr/local/lib/python3.13/site-packages/httpcore/_sync/connection_pool.py", line 256, in handle_request
raise exc from None
File "/usr/local/lib/python3.13/site-packages/httpcore/_sync/connection_pool.py", line 236, in handle_request
response = connection.handle_request(
pool_request.request
)
File "/usr/local/lib/python3.13/site-packages/httpcore/_sync/connection.py", line 101, in handle_request
raise exc
File "/usr/local/lib/python3.13/site-packages/httpcore/_sync/connection.py", line 78, in handle_request
stream = self._connect(request)
File "/usr/local/lib/python3.13/site-packages/httpcore/_sync/connection.py", line 124, in _connect
stream = self._network_backend.connect_tcp(**kwargs)
File "/usr/local/lib/python3.13/site-packages/httpcore/_backends/sync.py", line 207, in connect_tcp
with map_exceptions(exc_map):
~~~~~~~~~~~~~~^^^^^^^^^
File "/usr/lib64/python3.13/contextlib.py", line 162, in __exit__
self.gen.throw(value)
~~~~~~~~~~~~~~^^^^^^^
File "/usr/local/lib/python3.13/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
raise to_exc(exc) from exc
httpcore.ConnectError: [Errno 111] Connection refused
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.13/site-packages/openai/_base_client.py", line 969, in request
response = self._client.send(
request,
stream=stream or self._should_stream_response_body(request=request),
**kwargs,
)
File "/usr/local/lib/python3.13/site-packages/httpx/_client.py", line 914, in send
response = self._send_handling_auth(
request,
...<2 lines>...
history=[],
)
File "/usr/local/lib/python3.13/site-packages/httpx/_client.py", line 942, in _send_handling_auth
response = self._send_handling_redirects(
request,
follow_redirects=follow_redirects,
history=history,
)
File "/usr/local/lib/python3.13/site-packages/httpx/_client.py", line 979, in _send_handling_redirects
response = self._send_single_request(request)
File "/usr/local/lib/python3.13/site-packages/httpx/_client.py", line 1014, in _send_single_request
response = transport.handle_request(request)
File "/usr/local/lib/python3.13/site-packages/httpx/_transports/default.py", line 249, in handle_request
with map_httpcore_exceptions():
~~~~~~~~~~~~~~~~~~~~~~~^^
File "/usr/lib64/python3.13/contextlib.py", line 162, in __exit__
self.gen.throw(value)
~~~~~~~~~~~~~~^^^^^^^
File "/usr/local/lib/python3.13/site-packages/httpx/_transports/default.py", line 118, in map_httpcore_exceptions
raise mapped_exc(message) from exc
httpx.ConnectError: [Errno 111] Connection refused
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/bin/rag_framework", line 217, in <module>
args.func(args.vector_path) # pass vector_path argument to the respective function
~~~~~~~~~^^^^^^^^^^^^^^^^^^
File "/usr/bin/rag_framework", line 182, in run_rag
rag.cmdloop()
~~~~~~~~~~~^^
File "/usr/lib64/python3.13/cmd.py", line 146, in cmdloop
stop = self.onecmd(line)
File "/usr/lib64/python3.13/cmd.py", line 223, in onecmd
return self.default(line)
~~~~~~~~~~~~^^^^^^
File "/usr/bin/rag_framework", line 143, in default
self.query(user_content)
~~~~~~~~~~^^^^^^^^^^^^^^
File "/usr/bin/rag_framework", line 105, in query
response = self.llm.chat.completions.create(
model="your-model-name",
messages=[{"role": "user", "content": metaprompt}],
stream=True
)
File "/usr/local/lib/python3.13/site-packages/openai/_utils/_utils.py", line 287, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.13/site-packages/openai/resources/chat/completions/completions.py", line 925, in create
return self._post(
~~~~~~~~~~^
"/chat/completions",
^^^^^^^^^^^^^^^^^^^^
...<43 lines>...
stream_cls=Stream[ChatCompletionChunk],
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "/usr/local/lib/python3.13/site-packages/openai/_base_client.py", line 1239, in post
return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.13/site-packages/openai/_base_client.py", line 1001, in request
raise APIConnectionError(request=request) from err
openai.APIConnectionError: Connection error.
bash: line 1: 3 Aborted (core dumped) nohup llama-server --port 8080 --model /mnt/models/model.file --alias mistral:latest --ctx-size 2048 --temp 0.8 --jinja --cache-reuse 256 -v -ngl 999 --threads 11 --host 0.0.0.0 &> /tmp/llama-server.log
Describe the results you expected
A valid response and not a crashed process.
ramalama info output
{
"Accelerator": "intel",
"Engine": {
"Info": {
"host": {
"arch": "amd64",
"buildahVersion": "1.39.4",
"cgroupControllers": [
"cpu",
"io",
"memory",
"pids"
],
"cgroupManager": "systemd",
"cgroupVersion": "v2",
"conmon": {
"package": "conmon-2.1.13-1.fc42.x86_64",
"path": "/usr/bin/conmon",
"version": "conmon version 2.1.13, commit: "
},
"cpuUtilization": {
"idlePercent": 98.36,
"systemPercent": 0.41,
"userPercent": 1.23
},
"cpus": 22,
"databaseBackend": "sqlite",
"distribution": {
"distribution": "fedora",
"variant": "workstation",
"version": "42"
},
"eventLogger": "journald",
"freeLocks": 2039,
"hostname": "myfedora",
"idMappings": {
"gidmap": [
{
"container_id": 0,
"host_id": 17833,
"size": 1
},
{
"container_id": 1,
"host_id": 165536,
"size": 165536
}
],
"uidmap": [
{
"container_id": 0,
"host_id": 17833,
"size": 1
},
{
"container_id": 1,
"host_id": 165536,
"size": 165536
}
]
},
"kernel": "6.14.2-300.fc42.x86_64",
"linkmode": "dynamic",
"logDriver": "journald",
"memFree": 9243951104,
"memTotal": 66819031040,
"networkBackend": "netavark",
"networkBackendInfo": {
"backend": "netavark",
"dns": {
"package": "aardvark-dns-1.14.0-1.fc42.x86_64",
"path": "/usr/libexec/podman/aardvark-dns",
"version": "aardvark-dns 1.14.0"
},
"package": "netavark-1.14.1-1.fc42.x86_64",
"path": "/usr/libexec/podman/netavark",
"version": "netavark 1.14.1"
},
"ociRuntime": {
"name": "crun",
"package": "crun-1.21-1.fc42.x86_64",
"path": "/usr/bin/crun",
"version": "crun version 1.21\ncommit: 10269840aa07fb7e6b7e1acff6198692d8ff5c88\nrundir: /run/user/17833/crun\nspec: 1.0.0\n+SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +LIBKRUN +WASM:wasmedge +YAJL"
},
"os": "linux",
"pasta": {
"executable": "/usr/bin/pasta",
"package": "passt-0^20250320.g32f6212-2.fc42.x86_64",
"version": ""
},
"remoteSocket": {
"exists": true,
"path": "/run/user/17833/podman/podman.sock"
},
"rootlessNetworkCmd": "pasta",
"security": {
"apparmorEnabled": false,
"capabilities": "CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT",
"rootless": true,
"seccompEnabled": true,
"seccompProfilePath": "/usr/share/containers/seccomp.json",
"selinuxEnabled": true
},
"serviceIsRemote": false,
"slirp4netns": {
"executable": "",
"package": "",
"version": ""
},
"swapFree": 8588644352,
"swapTotal": 8589930496,
"uptime": "290h 0m 4.00s (Approximately 12.08 days)",
"variant": ""
},
"plugins": {
"authorization": null,
"log": [
"k8s-file",
"none",
"passthrough",
"journald"
],
"network": [
"bridge",
"macvlan",
"ipvlan"
],
"volume": [
"local"
]
},
"registries": {
"search": [
"registry.fedoraproject.org",
"registry.access.redhat.com",
"docker.io"
]
},
"store": {
"configFile": "/home/csutherl/.config/containers/storage.conf",
"containerStore": {
"number": 9,
"paused": 0,
"running": 0,
"stopped": 9
},
"graphDriverName": "overlay",
"graphOptions": {},
"graphRoot": "/home/csutherl/.local/share/containers/storage",
"graphRootAllocated": 1022488809472,
"graphRootUsed": 52024283136,
"graphStatus": {
"Backing Filesystem": "btrfs",
"Native Overlay Diff": "true",
"Supports d_type": "true",
"Supports shifting": "false",
"Supports volatile": "true",
"Using metacopy": "false"
},
"imageCopyTmpDir": "/var/tmp",
"imageStore": {
"number": 5
},
"runRoot": "/run/user/17833/containers",
"transientStore": false,
"volumePath": "/home/csutherl/.local/share/containers/storage/volumes"
},
"version": {
"APIVersion": "5.4.2",
"BuildOrigin": "Fedora Project",
"Built": 1743552000,
"BuiltTime": "Tue Apr 1 20:00:00 2025",
"GitCommit": "be85287fcf4590961614ee37be65eeb315e5d9ff",
"GoVersion": "go1.24.1",
"Os": "linux",
"OsArch": "linux/amd64",
"Version": "5.4.2"
}
},
"Name": "podman"
},
"Image": "quay.io/ramalama/intel-gpu:0.7",
"Runtime": "llama.cpp",
"Store": "/home/csutherl/.local/share/ramalama",
"UseContainer": true,
"Version": "0.7.4"
}
Upstream Latest Release
Yes
Additional environment details
No response
Additional information
No response
@ericcurtin @bmahabirbu PTAL
Tested on M4 and cannot reproduce the error with 0.8 images. I have a feeling it could be the intel-gpu image potentially
@csutherl can you try doing this with specifying the cpu only image ramalama --image quay.io/ramalama/ramalama ... and see if that works?
@bmahabirbu yeah, that immediately fails with a ModuleNotFoundError:
$ ramalama --debug --image quay.io/ramalama/ramalama run --rag quay.io/csutherl/ragdata mistral:latest
run_cmd: podman image inspect quay.io/csutherl/ragdata
Working directory: None
Ignore stderr: False
Ignore all: False
Command finished with return code: 0
Checking if 8080 is available
exec_cmd: podman run --rm -i --label ai.ramalama --name ramalama_Mw6bqX1BMZ --env=HOME=/tmp --init --security-opt=label=disable --cap-drop=all --security-opt=no-new-privileges --label ai.ramalama.model=mistral:latest --label ai.ramalama.engine=podman --label ai.ramalama.runtime=llama.cpp --label ai.ramalama.port=8080 --label ai.ramalama.command=run --env LLAMA_PROMPT_PREFIX=🦭 > --pull=newer -t -p 8080:8080 --device /dev/dri --device /dev/accel -e INTEL_VISIBLE_DEVICES=1 --network bridge --mount=type=image,source=quay.io/csutherl/ragdata,destination=/rag,rw=true --mount=type=bind,src=/home/csutherl/.local/share/ramalama/models/ollama/mistral:latest,destination=/mnt/models/model.file,ro quay.io/ramalama/ramalama:0.7 bash -c nohup llama-server --port 8080 --model /mnt/models/model.file --alias mistral:latest --ctx-size 2048 --temp 0.8 --jinja --cache-reuse 256 -v -ngl 999 --threads 11 --host 0.0.0.0 &> /tmp/llama-server.log & rag_framework run /rag/vector.db
Trying to pull quay.io/ramalama/ramalama:0.7...
Getting image source signatures
Copying blob 15b7d555f3bb skipped: already exists
Copying blob 3cca36172268 skipped: already exists
Copying blob 3277a691a607 skipped: already exists
Copying blob e98f4315d8c9 skipped: already exists
Copying config 8e941c6e41 done |
Writing manifest to image destination
Traceback (most recent call last):
File "/usr/bin/rag_framework", line 5, in <module>
import qdrant_client
ModuleNotFoundError: No module named 'qdrant_client'
ramalama --debug --image quay.io/ramalama/ramalama-rag run --rag quay.io/csutherl/ragdata mistral:latest
Need to use ramalama-rag image.
@afazekas PTAL
Using the ramalama-rag image I was able to successfully run, but it's giving bad output which is what I experienced in #1289. It also took much longer to start outputting anything, so I thought it was hung again. Output below:
$ ramalama --debug --image quay.io/ramalama/ramalama-rag run --rag quay.io/csutherl/ragdata mistral:latest
run_cmd: podman image inspect quay.io/csutherl/ragdata
Working directory: None
Ignore stderr: False
Ignore all: False
Command finished with return code: 0
Checking if 8080 is available
exec_cmd: podman run --rm -i --label ai.ramalama --name ramalama_AcSejiBQmV --env=HOME=/tmp --init --security-opt=label=disable --cap-drop=all --security-opt=no-new-privileges --label ai.ramalama.model=mistral:latest --label ai.ramalama.engine=podman --label ai.ramalama.runtime=llama.cpp --label ai.ramalama.port=8080 --label ai.ramalama.command=run --env LLAMA_PROMPT_PREFIX=🦭 > --pull=newer -t -p 8080:8080 --device /dev/dri --device /dev/accel -e INTEL_VISIBLE_DEVICES=1 --network bridge --mount=type=image,source=quay.io/csutherl/ragdata,destination=/rag,rw=true --mount=type=bind,src=/home/csutherl/.local/share/ramalama/models/ollama/mistral:latest,destination=/mnt/models/model.file,ro quay.io/ramalama/ramalama-rag:0.7 bash -c nohup llama-server --port 8080 --model /mnt/models/model.file --alias mistral:latest --ctx-size 2048 --temp 0.8 --jinja --cache-reuse 256 -v -ngl 999 --threads 11 --host 0.0.0.0 &> /tmp/llama-server.log & rag_framework run /rag/vector.db
Trying to pull quay.io/ramalama/ramalama-rag:0.7...
Getting image source signatures
Copying blob c9c8ac1b9f5c done |
Copying blob e98f4315d8c9 skipped: already exists
Copying blob 15b7d555f3bb skipped: already exists
Copying blob 3cca36172268 skipped: already exists
Copying blob 3277a691a607 skipped: already exists
Copying blob 71d803926e77 done |
Copying config 50298aa7d3 done |
Writing manifest to image destination
> test
breathe dallaYPE août ByName ThereprintStackTraceaz tend Using dallaViewByIdiver Using breat près breat breathe nell UsingimgCtx ottobre tend dallaiver breat ottobreaz ottobre se CONDITIONtot Однаxml Usingtot nellaz sulla nell ...ViewById sullaYPE CONDITION is dalla ps Package Packageaz containViewByIdtot nell ottobreœimg près tendё Userostream Theё I se nell Using sulla Одна ps UsingYPE dalla nell consist UsingœYPE.^{[ près ottobreCtx ottobre CONDITION ostream Userimg dalla ps nell washViewByIdе août sticks sullaCtx wash nellimg CONDITION ...ViewById ... sulla containœ août nell se contain CONDITION.^{[ breatheCtxœ Tot ... aoûttot ... contain Using dallatot breatazostream It ...printStackTraceеIsNull Одна str breathe CONDITIONtot wash ...azViewByIdViewByIdViewById consistCtx wash CONDITION [...]ostream EachCtx tend Usingimg breathe Usingtot CONDITION TotprintStackTraceCtxtot nell breathe Package consist se nell ps Using nellimg sulla breathe nell dallaprintStackTraceYPE Using ottobre.^{[ œtot Using près ОднаprintStackTrace CONDITIONimg UsingprintStackTrace psazostream # dalla Одна ottobretotimgYPE sulla breat UsingprintStackTraceCtx Package Tot Totiverе^C
^C
I also noticed that the prompt doesn't contain the prefix too (so I actually thought it was hung when the prompt I provided didn't immediately return).
Thanks @rhatdan sorry @csutherl I didnt specify ramalama-rag image. So if you run a normal ramalama run or serve with the intel-gpu image does it work as expected? Specifically, when you run the serve command is it on port 8080?
I see that you were having issues with the base ramalama image, so if you're seeing the same errors as that means it's definitely something to do with the intel gpu container
@bmahabirbu no problem :D I'm learning a lot with this experimentation. Yes, the intel-gpu itself works just fine when running or serving. The issue only crops up when trying to use rag.
I can reproduce this having Intel Arc GPU. Using --image quay.io/ramalama/ramalama-rag fixes this and I have proper output, but I lost the prompt prefix also (minor problem).
[retro@retro2 tmp]❤ ramalama --image quay.io/ramalama/ramalama-rag run llama3.1
🦭 >
[retro@retro2 tmp]❤ ramalama --image quay.io/ramalama/ramalama-rag run --rag test-data llama3.1
>
Anything I can share to help resolve this?
Hi!!!!! Hey csutherl,
Thanks for the thorough report! The debug output really helps figure out what’s going on.
So, the main issue here is that the llama-server, which serves the LLM model in the ramalama container, is crashing right after it starts. This is causing the httpx.ConnectError: [Errno 111] Connection refused that rag_framework (and curl) are running into. The APIConnectionError comes from llama-server being unavailable.
This quick crash usually hints at issues with your environment, particularly when GPU acceleration is in play, or if there's a problem loading the model.
Here’s a step-by-step plan to troubleshoot:
-
Check the Internal Log File: Your podman run command sends llama-server's output to /tmp/llama-server.log in the container. This log will show why llama-server crashed. After the ramalama run command fails and the container exits (or quickly if you can):
- Find the container's ID or name with
podman ps -a. - Then get the log using:
podman cp <container-id-or-name>:/tmp/llama-server.log .(If it crashes too fast, this could be tough). - Alternatively, run the llama-server command manually in the container (see step 4).
- Find the container's ID or name with
-
Check Intel GPU/Accelerator Issues: Since you're using
--device /dev/dri --device /dev/accel -e INTEL_VISIBLE_DEVICES=1and offloading with-ngl 999, there may be a problem with your Intel GPU drivers or how llama.cpp works with your GPU/OpenCL/Level Zero setup.-
Test 1 (Reduce NGL): Try setting
-nglto a smaller number like 20 or 0 (for CPU only, if needed) to see if it runs:$ ramalama run --rag quay.io/csutherl/ragdata mistral:latest --params=-ngl 20. - Test 2 (Verify Host Drivers): Make sure your Fedora 42 system has the latest Intel GPU drivers and all necessary parts for llama.cpp to use the GPU. Sometimes updating or reinstalling GPU drivers fixes this.
-
Test 1 (Reduce NGL): Try setting
-
Check Model Integrity: Is the mistral:latest model file at
/home/csutherl/.local/share/ramalama/models/ollama/mistral:latestcompletely downloaded and not corrupted? You might want to check the model's integrity (like using a checksum) or run it with a standardollama run mistral:latestif you have Ollama installed separately to make sure the model is fine. -
Run llama-server Manually: Try isolating the llama-server startup. Get into the container's shell with:
podman run --rm -it --network bridge --device /dev/dri --device /dev/accel -e INTEL_VISIBLE_DEVICES=1 --mount=type=image,source=quay.io/csutherl/ragdata,destination=/rag,rw=true --mount=type=bind,src=/home/csutherl/.local/share/ramalama/models/ollama/mistral:latest,destination=/mnt/models/model.file,ro quay.io/ramalama/intel-gpu-rag:0.7 bashOnce inside, run this command directly:
/usr/bin/llama-server --port 8080 --model /mnt/models/model.file --alias mistral:latest --ctx-size 2048 --temp 0.8 --jinja --cache-reuse 256 -v -ngl 999 --threads 11 --host 0.0.0.0Watch the output to get immediate feedback on what’s causing the crash.
-
Check Podman Logs: After the ramalama run fails, check with
podman logs <container-id-or-name>to see if there are any relevant messages before it crashes.
Looking at either the llama-server's output or its log should help you figure out the core dump reason. I’m guessing it’s related to either the GPU setup or the model file.
Let me know what you find in the llama-server.log or if you have any luck running it manually!
Best,
Suman Suhag
Thanks @sumansuhag for the detailed debugging steps this should help narrow the issue down! I have a hunch it has to do with the Intel GPU drivers interacting with llama.cpp. It could be the container drivers and llama.cpp versions werent locked down to preserve the functionality
Hey Brian,
No problem at all! I’m happy the debugging steps helped.
You’re right about the Intel GPU drivers and their interaction with llama.cpp. The version differences between the container drivers and llama.cpp could really be causing the issues you're seeing. Mismatched versions can definitely cause weird problems like core dumps, especially with hardware acceleration in containers.
Let me know what you figure out as you dig deeper. I’m looking forward to hearing how it turns out!
On Sun, Jun 8, 2025 at 3:45 PM Brian M @.***> wrote:
bmahabirbu left a comment (containers/ramalama#1306) https://github.com/containers/ramalama/issues/1306#issuecomment-2954311989
Thanks @sumansuhag https://github.com/sumansuhag for the detailed debugging steps this should help narrow the issue down! I have a hunch it has to do with the Intel GPU drivers interacting with llama.cpp. It could be the container drivers and llama.cpp versions werent locked down to preserve the functionality
— Reply to this email directly, view it on GitHub https://github.com/containers/ramalama/issues/1306#issuecomment-2954311989, or unsubscribe https://github.com/notifications/unsubscribe-auth/BJBHOMORGI3Q34CUQ64GIB33CS4IRAVCNFSM6AAAAAB4DKCEHKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDSNJUGMYTCOJYHE . You are receiving this because you were mentioned.Message ID: @.***>
I am running into this same exact issue @csutherl originally outlined when trying to utitilize Rag using the following command
ramalama --image quay.io/ramalama/ramalama-rag run --rag quay.io/rh-ee-istaplet/ramallama:latest llama3.1 (and have tried other models such as granite and mistral)
This image quay.io/rh-ee-istaplet/ramallama:latest was generated by running ramalama rag ./virtual.pdf quay.io/rh-ee-istaplet/ramallama
I have tried some of the above debugging steps and have not been successful in getting Rag working.
My laptop does not have a dedicated GPU. Is a dedicated GPU required for doing Rag with Ramalama?
Hi Isaiah,
Thanks for reaching out and providing details on the issue you're encountering with Ramalama RAG!
It's helpful to know you're seeing the same behavior @csutherl outlined, and that you've tried different models.
Regarding your question about a dedicated GPU for RAG with Ramalama:
- For pure CPU-based RAG operations and smaller models, a dedicated GPU is often not strictly required. Many RAG setups, especially for basic retrieval and even inference with smaller LLMs, can run on a CPU.
- However, for performance with larger LLMs or complex RAG pipelines, a dedicated GPU is highly recommended, and sometimes effectively necessary. Without a GPU, the inference process for the language model part of RAG can be very slow, potentially leading to timeouts or unacceptably long response times. The "llama3.1", "granite", and "mistral" models, while varying in size, can still be quite demanding for CPU-only inference.
Given you're having trouble getting RAG "working" (which could mean it's too slow, or genuinely failing), the lack of a dedicated GPU is a strong candidate for being a bottleneck, especially if the models are defaulting to CPU execution due to no GPU being found.
To help diagnose further, consider:
- Any specific error messages or timeouts you're seeing in the console beyond just "not working."
- How long the command is running before it seems to fail or stop.
- Ramalama's documentation for any specific hardware requirements or recommendations for RAG deployments.
While it might be possible to run it on CPU, the performance impact can be severe. If you're encountering timeouts, the CPU might simply not be fast enough to process the model's inference within typical timeout windows.
On Mon, Jun 9, 2025 at 8:28 AM Isaiah Stapleton @.***> wrote:
IsaiahStapleton left a comment (containers/ramalama#1306) https://github.com/containers/ramalama/issues/1306#issuecomment-2956136243
I am running into this same exact issue @csutherl https://github.com/csutherl originally outlined when trying to utitilize Rag using the following command
ramalama --image quay.io/ramalama/ramalama-rag run --rag quay.io/rh-ee-istaplet/ramallama:latest llama3.1 (and have tried other models such as granite and mistral)
This image quay.io/rh-ee-istaplet/ramallama:latest was generated by running ramalama rag ./virtual.pdf quay.io/rh-ee-istaplet/ramallama
I have tried some of the above debugging steps and have not been successful in getting Rag working.
My laptop does not have a dedicated GPU. Is a dedicated GPU required for doing Rag with Ramalama?
— Reply to this email directly, view it on GitHub https://github.com/containers/ramalama/issues/1306#issuecomment-2956136243, or unsubscribe https://github.com/notifications/unsubscribe-auth/BJBHOMMKZ7ZQVZ5VXC44MSD3CWR2RAVCNFSM6AAAAAB4DKCEHKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDSNJWGEZTMMRUGM . You are receiving this because you were mentioned.Message ID: @.***>
EDIT: what previously was written here is probably not relevant for this issue (based on the timestamp when the issue was reported and when 0.8.5 was released). I reported a separate bug https://github.com/containers/ramalama/issues/1521
@simi @thom311 Still having problems?
Yes, still problems @rhatdan. Tested with 0.11.1. With --rag it crashes the server.
bash: line 1: 3 Aborted (core dumped) nohup llama-server --port 8080 --model /mnt/models/mistral-7b-instruct-v0.2.Q4_K_M.gguf --no-warmup --jinja --log-colors --alias TheBloke/Mistral-7B-Instruct-v0.2-GGUF/mistral-7b-instruct-v0.2.Q4_K_M.gguf --ctx-size 2048 --temp 0.8 --cache-reuse 256 -ngl 999 --threads 11 --host 0.0.0.0 &> /tmp/llama-server.log
I see this as an upstream issue with llama.cpp at this point and the only thing RamaLama might be able to do is to always select the ramalama/Vulkan image rather then using the intel-gpu image.
Have you opened an issue with llama.cpp on this?
I have not opened, since I don't know what to report in there. It was tested on Intel GPU based notebook.
I think the stack dump log and the information about your hardware to start. Then we can see what they say.
Since we have had no further feedback, I am going to close this issue. Reopen if this is still a problem with RamaLama.