ragflow [Feature Request]: Can we add configuration items for customizing the API request rate and token quantity?

Is there an existing issue for the same feature request?

[x] I have checked the existing issues.

Is your feature request related to a problem?

Recently, when using the API request of SiliconAPI, I found that an RPM error occurred during document parsing, which caused the document parsing to fail. I tried to modified Dockerfile to install the ratelimit and tiktoken packages during the build process, and added a modified class to the llm directory so that there would be no rate limit error when requesting chat model, embedding model, rerank model, etc.

Describe the feature you'd like

Recently, when using the API request of SiliconAPI, I found that an RPM error occurred during document parsing, which caused the document parsing to fail. I tried to modified Dockerfile to install the ratelimit and tiktoken packages during the build process, and added a modified class to the llm directory so that there would be no rate limit error when requesting chat model, embedding model, rerank model, etc.

Describe implementation you've considered

No response

Documentation, adoption, use case

Additional information

No response

Mar 07 '25 18:03 kostya-sec

export MAX_CONCURRENT_CHATS=10

Mar 10 '25 03:03 KevinHuSh

@carcoonzyk @kostya-sec LLM chat already support rate limit and retry: https://github.com/infiniflow/ragflow/blob/94181a990b957ed302952b4de17583d2b44f3099/rag/llm/chat_model.py#L178

You can do the similar thing for embedding models.

May 15 '25 15:05 yuzhichang