ragflow [Question]: How can I improve the concurrency performance of the ragflow_stream

Describe your problem

For ragflow_streaming_output api, when I set the number of concurrent requests to 1, 10, and 100, the first token latency was 0.6719s, 4.7593s, and 41.9158s, respectively. Due to the existence of the retrieval link, the concurrency performance of ragflow_stream_output api is weak, which is not conducive to large-scale applications. How can I improve the concurrency performance of ragflow_stream_output api?

Nov 25 '24 16:11 Weishaoya

Change the run_simple function in api/ragflow_server.py.

Nov 26 '24 01:11 KevinHuSh

Change the run_simple function in api/ragflow_server.py. I have changed the run_simple function to the Gunicorn. The concurrency performance of ragflow_stream_output improves when I set workers to 10, but it has a problem that the embedding model will be loaded 10 times on gpu-0. Can you provide a better way to improve the concurrency performance? Thank you!

Nov 26 '24 16:11 Weishaoya

修改api/ragflow_server.py中的run_simple函数，我已经将run_simple函数改成Gunicorn了，当我将workers设置为10时，ragflow_stream_output的并发性能有所提升，但是存在一个问题，即在gpu-0上embedding模型会被加载10次，能否提供更好的方法来提高并发性能？谢谢！

How did you test it, can you give a code, thanks

Mar 10 '25 01:03 Mr-greenplus

Change the run_simple function in api/ragflow_server.py. I have changed the run_simple function to the Gunicorn. The concurrency performance of ragflow_stream_output improves when I set workers to 10, but it has a problem that the embedding model will be loaded 10 times on gpu-0. Can you provide a better way to improve the concurrency performance? Thank you!

We solved it by setting embedding model in another service with vllm. Maybe change the embedding model loading logic and take it out from app and run it as a singleton also work but apparently it takes more time to code XD

May 21 '25 07:05 wizounovziki

[Question]: How can I improve the concurrency performance of the ragflow_stream_output api？

Describe your problem