ragflow [Bug]: Use much more VRAM to chat with Ollama models than running in cmd

Self Checks

[x] I have searched for existing issues search for existing issues, including closed ones.
[x] I confirm that I am using English to submit this report (Language Policy).
[x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
[x] Please do not modify this template :) and fill in all the required fields.

RAGFlow workspace code commit ID

.

RAGFlow image version

nightly

Other environment information

Actual behavior

when i excute ollama run qwen3:32b, and ollama ps it shows 45g VRAM occupied. But when i try bind this model to ragflow, it shows need 71g VRAM and results 30% parameters loaded on CPU I was thinking its ollama problem, however I uninstall and install it with many versions, there still is this issue. I dont know whether it is because of ragflow. Could you clarify how ragflow invokes ollama server?

Expected behavior

No response

Steps to reproduce

fail to bind qwen3:32b to ragflow although run it smoothly in cmd

Additional information

No response

May 30 '25 09:05 konn-submarine-bu

RAGFlow invokes Ollama models through dedicated adapter classes (like OllamaChat, OllamaEmbed, and OllamaCV) that use the Ollama Python client to communicate with the Ollama server over HTTP. These adapters allow RAGFlow to set options such as context size, batch size, and memory mapping (mmap), all of which can significantly affect VRAM usage depending on how they're configured and the size of the model or prompt being processed. There is no evidence of extra VRAM tuning or alternative invocation mechanisms beyond these adapters, so the resource usage is largely determined by the parameters passed to the Ollama server and the server/model configuration itself reference.

The higher VRAM usage you observe when binding the qwen3:32b model to RAGFlow (71GB vs. 45GB in direct Ollama CLI use) may be due to RAGFlow's default or dynamic context size, batch processing, or other options that increase memory pressure. Features like reranker, RAPTOR retrieval, auto-keyword, and auto-question can also increase resource usage—disabling them may help reference.

To reduce VRAM usage, consider: disabling the above features, using the slim Docker image with RAGFLOW_SKIP_DEEPDOC=1, or limiting GPU usage with the CUDA_VISIBLE_DEVICES environment variable reference reference. If you encounter GPU memory allocation errors, you can run the RAGFlow server on CPU, as GPU acceleration is not required for the server itself.

There are no open issues that directly address this specific VRAM usage difference, but performance and resource allocation challenges with Ollama models in RAGFlow have been discussed, and the above configuration changes have helped others in similar situations reference.

If you need more granular control over VRAM usage, you may want to experiment with the context size and batch parameters in your RAGFlow Ollama model configuration, or consider hosting the model with vLLM for better concurrency and memory management.

To reply, just mention my name @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

May 30 '25 09:05 dosubot[bot]

what code is executed when trying bind ollama models and how to adjust RAGFlow Ollama model configuration? @dosu

May 31 '25 02:05 konn-submarine-bu

When you bind Ollama models in RAGFlow, the framework uses dedicated adapter classes—OllamaChat, OllamaEmbed, and OllamaCV—to interface with the Ollama server via the Ollama Python client. These classes are responsible for initializing the connection (using the host URL and optional authorization headers) and for sending requests to the Ollama server for chat, embedding, or vision tasks. For example, OllamaChat dynamically calculates the context window size based on the token count of the message history and sets options like context size (num_ctx), temperature, max tokens (num_predict), top_p, presence_penalty, and frequency_penalty for each chat request. These options are passed as part of the gen_conf parameter and directly affect memory/VRAM usage [source].

To adjust RAGFlow Ollama model configuration for VRAM management, you can:

Modify the context size and max tokens in your model configuration (these are the most impactful for VRAM usage). For chat models, this is typically done by setting max_tokens in the configuration file (e.g., conf/llm_factories.json) or by passing options in the API call [source].
For embedding models, the OllamaEmbed class uses use_mmap: True by default, which can also affect VRAM usage [source].
You can specify different model factories or override defaults by appending the factory name to the model name using the @ symbol in your configuration [source].

There is no direct facility in RAGFlow to tune lower-level Ollama or GPU/VRAM parameters beyond these options; deeper resource management (like GPU selection or advanced memory mapping) must be handled via the Ollama server itself or its environment variables.

If you want to further reduce VRAM usage, consider lowering the context size, disabling features like reranker or RAPTOR retrieval, or running RAGFlow on CPU if GPU memory is a constraint. For advanced tuning, refer to Ollama’s own documentation for server-side options.

To reply, just mention my name @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

May 31 '25 02:05 dosubot[bot]

i handled my problem after setting OLLAMA_NUM_GPU=1 and updating ollama to latest version

Jun 03 '25 06:06 konn-submarine-bu

i handled my problem after setting OLLAMA_NUM_GPU=1 and updating ollama to latest version

still can't do it. May I ask if you have configured anything else, such as environment variables?

Jun 07 '25 05:06 2731557667

在设置 OLLAMA_NUM_GPU=1 并将 OLLAMA 更新到最新版本后，我处理了我的问题

Mine doesn't work. My ollama is 0.9.0, and cannot use ragflow0.19.

Jun 12 '25 00:06 RNGMARTIN