[Bug]: Use much more VRAM to chat with Ollama models than running in cmd
Self Checks
- [x] I have searched for existing issues search for existing issues, including closed ones.
- [x] I confirm that I am using English to submit this report (Language Policy).
- [x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
- [x] Please do not modify this template :) and fill in all the required fields.
RAGFlow workspace code commit ID
.
RAGFlow image version
nightly
Other environment information
Actual behavior
when i excute ollama run qwen3:32b, and ollama ps
it shows 45g VRAM occupied.
But when i try bind this model to ragflow, it shows need 71g VRAM and results 30% parameters loaded on CPU
I was thinking its ollama problem, however I uninstall and install it with many versions, there still is this issue.
I dont know whether it is because of ragflow.
Could you clarify how ragflow invokes ollama server?
Expected behavior
No response
Steps to reproduce
fail to bind qwen3:32b to ragflow although run it smoothly in cmd
Additional information
No response
RAGFlow invokes Ollama models through dedicated adapter classes (like OllamaChat, OllamaEmbed, and OllamaCV) that use the Ollama Python client to communicate with the Ollama server over HTTP. These adapters allow RAGFlow to set options such as context size, batch size, and memory mapping (mmap), all of which can significantly affect VRAM usage depending on how they're configured and the size of the model or prompt being processed. There is no evidence of extra VRAM tuning or alternative invocation mechanisms beyond these adapters, so the resource usage is largely determined by the parameters passed to the Ollama server and the server/model configuration itself reference.
The higher VRAM usage you observe when binding the qwen3:32b model to RAGFlow (71GB vs. 45GB in direct Ollama CLI use) may be due to RAGFlow's default or dynamic context size, batch processing, or other options that increase memory pressure. Features like reranker, RAPTOR retrieval, auto-keyword, and auto-question can also increase resource usage—disabling them may help reference.
To reduce VRAM usage, consider: disabling the above features, using the slim Docker image with RAGFLOW_SKIP_DEEPDOC=1, or limiting GPU usage with the CUDA_VISIBLE_DEVICES environment variable reference reference. If you encounter GPU memory allocation errors, you can run the RAGFlow server on CPU, as GPU acceleration is not required for the server itself.
There are no open issues that directly address this specific VRAM usage difference, but performance and resource allocation challenges with Ollama models in RAGFlow have been discussed, and the above configuration changes have helped others in similar situations reference.
If you need more granular control over VRAM usage, you may want to experiment with the context size and batch parameters in your RAGFlow Ollama model configuration, or consider hosting the model with vLLM for better concurrency and memory management.
To reply, just mention my name @dosu.
How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other
what code is executed when trying bind ollama models and how to adjust RAGFlow Ollama model configuration? @dosu
When you bind Ollama models in RAGFlow, the framework uses dedicated adapter classes—OllamaChat, OllamaEmbed, and OllamaCV—to interface with the Ollama server via the Ollama Python client. These classes are responsible for initializing the connection (using the host URL and optional authorization headers) and for sending requests to the Ollama server for chat, embedding, or vision tasks. For example, OllamaChat dynamically calculates the context window size based on the token count of the message history and sets options like context size (num_ctx), temperature, max tokens (num_predict), top_p, presence_penalty, and frequency_penalty for each chat request. These options are passed as part of the gen_conf parameter and directly affect memory/VRAM usage [source].
To adjust RAGFlow Ollama model configuration for VRAM management, you can:
- Modify the context size and max tokens in your model configuration (these are the most impactful for VRAM usage). For chat models, this is typically done by setting
max_tokensin the configuration file (e.g.,conf/llm_factories.json) or by passing options in the API call [source]. - For embedding models, the OllamaEmbed class uses
use_mmap: Trueby default, which can also affect VRAM usage [source]. - You can specify different model factories or override defaults by appending the factory name to the model name using the
@symbol in your configuration [source].
There is no direct facility in RAGFlow to tune lower-level Ollama or GPU/VRAM parameters beyond these options; deeper resource management (like GPU selection or advanced memory mapping) must be handled via the Ollama server itself or its environment variables.
If you want to further reduce VRAM usage, consider lowering the context size, disabling features like reranker or RAPTOR retrieval, or running RAGFlow on CPU if GPU memory is a constraint. For advanced tuning, refer to Ollama’s own documentation for server-side options.
To reply, just mention my name @dosu.
How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other
i handled my problem after setting OLLAMA_NUM_GPU=1 and updating ollama to latest version
i handled my problem after setting OLLAMA_NUM_GPU=1 and updating ollama to latest version
still can't do it. May I ask if you have configured anything else, such as environment variables?
在设置 OLLAMA_NUM_GPU=1 并将 OLLAMA 更新到最新版本后,我处理了我的问题
Mine doesn't work. My ollama is 0.9.0, and cannot use ragflow0.19.