vllm
vllm copied to clipboard
[Core] Add retention policy code for processing requests
This pull request is intended to resolve the concurrency problem across various LLMs.
It brings about two important configurable environmental variables:
- VLLM_ENGINE_MAX_CONCURRENT_REQUESTS: Max concurrent requests to process at anytime, if set.
- VLLM_ENGINE_MAX_REQUEST_LIFESPAN: Max lifespan in seconds for any request, if set.
I use a specific model called Mixtral-Instruct which will be unstable with trivial parameter configurations, and will generate nonsense for a long time till it hits limit.
For many implementations of Langchain it will not abort connection after interruption and even if user refreshes the frontend vLLM does not stop generation therefore slowing down the overall generation process, leaving the server occupied for nothing.
WIth this solution long running requests will be killed during each new request (soft limit), and older running requests will be killed if they are out of request quota (hard limit).
FIX #4240
I think I was able to change the code in LLMEngine instead of AsyncLLMEngine, but since most concurrent issues happens on APIs and AsyncLLMEngine is used in APIs, this pull request is sufficient for most use cases.
People who invoke LLM by libraries shall abort unwanted tasks themselves, which offers better and more comprehensive controls.
@James4Ever0 could you try your case again now that fix #4363 has been merged?
As for the time being tested, this modification done to vLLM version 0.3.2 has no effect to the issue in my environment.
My modification is applied to the file vllm/entrypoints/openai/serving_completion.py.
Meanwhile my pull request is performing well with some minor logic error. I will patch it soon.
I will post future test results for latest vLLM.
Which parameter in vLLM server corresponds to TGI's --max-concurrent-requests? When can the VLLM_ENGINE_MAX_CONCURRENT_REQUESTS parameter be used?
Which parameter in vLLM server corresponds to TGI's --max-concurrent-requests? When can the VLLM_ENGINE_MAX_CONCURRENT_REQUESTS parameter be used?
Modify the code yourself before this pull request being merged or implemented. The code structure has changed a lot. Check the diff for clues.