[Feature Request]: Can we add configuration items for customizing the API request rate and token quantity?
Is there an existing issue for the same feature request?
- [x] I have checked the existing issues.
Is your feature request related to a problem?
Recently, when using the API request of SiliconAPI, I found that an RPM error occurred during document parsing, which caused the document parsing to fail. I tried to modified Dockerfile to install the ratelimit and tiktoken packages during the build process, and added a modified class to the llm directory so that there would be no rate limit error when requesting chat model, embedding model, rerank model, etc.
Describe the feature you'd like
Recently, when using the API request of SiliconAPI, I found that an RPM error occurred during document parsing, which caused the document parsing to fail. I tried to modified Dockerfile to install the ratelimit and tiktoken packages during the build process, and added a modified class to the llm directory so that there would be no rate limit error when requesting chat model, embedding model, rerank model, etc.
Describe implementation you've considered
No response
Documentation, adoption, use case
Additional information
No response
export MAX_CONCURRENT_CHATS=10
@carcoonzyk @kostya-sec LLM chat already support rate limit and retry: https://github.com/infiniflow/ragflow/blob/94181a990b957ed302952b4de17583d2b44f3099/rag/llm/chat_model.py#L178
You can do the similar thing for embedding models.