ragflow [Bug]: EXTREME HIGH CPU while using OLLAMA after upgrating ragflow to 0.19

Self Checks

[x] I have searched for existing issues search for existing issues, including closed ones.
[x] I confirm that I am using English to submit this report (Language Policy).
[x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
[x] Please do not modify this template :) and fill in all the required fields.

RAGFlow workspace code commit ID

no

RAGFlow image version

0.19.0

Other environment information

Actual behavior

EXTREME HIGH CPU while using OLLAMA after upgrating ragflow to 0.19. It seems that the ragflow not using GPU but CPU while running the ollama. There was no such problem when the version of ragflow is 0.17.0.

Expected behavior

No response

Steps to reproduce

using ollama in ragflow0.19.0

Additional information

No response

Jun 09 '25 00:06 RNGMARTIN

High GPU usage with Ollama in RAGFlow 0.19.0 is a known issue and has been discussed in several threads. The main causes are often related to how RAGFlow configures Ollama model options like context size, batch size, and memory mapping, which can significantly increase VRAM/GPU usage compared to previous versions. Features such as reranker, RAPTOR retrieval, auto-keyword, and auto-question can also add to GPU load. Disabling these features, lowering context size, or adjusting batch size can help reduce GPU usage. For some users, switching from Ollama to vLLM has provided better concurrency and lower GPU overhead, especially for large models or high request volumes. If you want to further limit GPU usage, you can also set the CUDA_VISIBLE_DEVICES environment variable or run RAGFlow on CPU if GPU memory is a constraint. There are no direct code changes in 0.19.0 that force higher GPU usage, but the default or dynamic settings may be more aggressive than in 0.17.0. For more details and configuration examples, see the discussions in issue #7981 and issue #7535.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Jun 09 '25 00:06 dosubot[bot]

@KevinHuSh It's quite urgent. I want to use the function of 0.19 but I cannot use it with this problem exist.

Jun 09 '25 00:06 RNGMARTIN

OLLAMA_KEEP_ALIVE=-1 ollama serve

OR,

ollama run llama3.1:70b --keepalive=-1m

OR, alter the code back from https://github.com/infiniflow/ragflow/commit/5b626870d05167026049c6ccf122494971459ee5

Jun 09 '25 02:06 KevinHuSh

OLLAMA_KEEP_ALIVE=-1 ollama serve
OR,
ollama run llama3.1:70b --keepalive=-1m
OR, alter the code back from 5b62687

thanks. I will take a try.

Jun 09 '25 02:06 RNGMARTIN

@KevinHuSh I have tried all three ways you had given, none of them works. The problem still exist. I wonder why such problem would exist? For version0.17.0 it was completely OK.

Jun 09 '25 09:06 RNGMARTIN

@KevinHuSh Is there any other plan to fix this problem? sry for bothering so much time. It's really urgent for me.

Jun 10 '25 02:06 RNGMARTIN

Try the latest code. There's a PR about this issue.

Jun 12 '25 06:06 KevinHuSh

@KevinHuSh I have updated to the latest code. For some agent using single LLM in ollama it was ok. However, for this agent ,which i use most frequently, the problem still exists.

As you can see in this agent, I use three diffenent LLM from ollama. While the agent started, it was completely ok for GPU and CPU during the first LLM(No.1 in the picture1). However, when the agent comes to the last LLM(No.2 in the picture1), the GPU has problem ,and CPU suddenly went extremly high(picture 2).

Origin problem I have been facing since updating ragflow to 0.19 is when the agent comes to the last LLM, the GPU turns to 0.9GB, now the situation has changed, but turn to the situation in picture2. Whatever updating the code or not, when it comesto the last LLM, the CPU is always extremly high. I changed the last LLM in my agent from deepseek32B-max2(which I extend the ctx to 16384) to the original deepseek32B then to gemma3, whatever LLM I change, the problem always happens.

For mistral-small, it approximaty using 15GB and for deepseek32B, it about using 21GB. I think the problem is that while running the deepseek32B, the mistral-small didn't released and continued running.

Jun 12 '25 09:06 RNGMARTIN

@KevinHuSh any other plan to solve this problem ? Thanks.

Jun 18 '25 00:06 RNGMARTIN

@KevinHuSh @RNGMARTIN I think part of this problem is that Ollama for whatever reason doesnt release the connection, I am starting to think that this is why GraphRAG fails on parsing because it will hold the connection for many hours. I have heard to vLLM and other options might solve this but I like the workflow of Ollama!

Is Ragflow closing the connection when it is finished - there could be race conditions that prevent this from happening in situations?

Jun 30 '25 09:06 voycey

@KevinHuSh @RNGMARTIN I think part of this problem is that Ollama for whatever reason doesnt release the connection, I am starting to think that this is why GraphRAG fails on parsing because it will hold the connection for many hours. I have heard to vLLM and other options might solve this but I like the workflow of Ollama!

Is Ragflow closing the connection when it is finished - there could be race conditions that prevent this from happening in situations?

This problem only happens after updating ragflow to version 0.19. I am now using 0.17 which has no such problem. I don't know what is changed that caused this problem.

Jun 30 '25 09:06 RNGMARTIN

I have the same issue even I have updated to 0.19.1. What I did is simply

create an assistant
add a Knowledge Base to that assistant
select a open source model from Ollama, e.g. deepseek-r1:14b, qwen3:14b...etc
start a new chat Initially the GPU hikes which is expected and then, sooner, the GPU usage drops and CPU usage hikes where all CPU cores are fully occupied for a very long time until the chat is completed.

@RNGMARTIN wondering you figure out a workaround?

Jul 17 '25 15:07 tnndclub