vllm [Core] Add retention policy code for processing requests

This pull request is intended to resolve the concurrency problem across various LLMs.

It brings about two important configurable environmental variables:

VLLM_ENGINE_MAX_CONCURRENT_REQUESTS: Max concurrent requests to process at anytime, if set.
VLLM_ENGINE_MAX_REQUEST_LIFESPAN: Max lifespan in seconds for any request, if set.

I use a specific model called Mixtral-Instruct which will be unstable with trivial parameter configurations, and will generate nonsense for a long time till it hits limit.

For many implementations of Langchain it will not abort connection after interruption and even if user refreshes the frontend vLLM does not stop generation therefore slowing down the overall generation process, leaving the server occupied for nothing.

WIth this solution long running requests will be killed during each new request (soft limit), and older running requests will be killed if they are out of request quota (hard limit).

FIX #4240

May 01 '24 03:05 James4Ever0

I think I was able to change the code in LLMEngine instead of AsyncLLMEngine, but since most concurrent issues happens on APIs and AsyncLLMEngine is used in APIs, this pull request is sufficient for most use cases.

People who invoke LLM by libraries shall abort unwanted tasks themselves, which offers better and more comprehensive controls.

May 01 '24 04:05 James4Ever0

@James4Ever0 could you try your case again now that fix #4363 has been merged?

May 04 '24 22:05 njhill

As for the time being tested, this modification done to vLLM version 0.3.2 has no effect to the issue in my environment.

My modification is applied to the file vllm/entrypoints/openai/serving_completion.py.

Meanwhile my pull request is performing well with some minor logic error. I will patch it soon.

I will post future test results for latest vLLM.

May 06 '24 01:05 James4Ever0

Which parameter in vLLM server corresponds to TGI's --max-concurrent-requests? When can the VLLM_ENGINE_MAX_CONCURRENT_REQUESTS parameter be used?

Aug 01 '24 04:08 luoyiroy

Which parameter in vLLM server corresponds to TGI's --max-concurrent-requests? When can the VLLM_ENGINE_MAX_CONCURRENT_REQUESTS parameter be used?

Modify the code yourself before this pull request being merged or implemented. The code structure has changed a lot. Check the diff for clues.

Aug 01 '24 04:08 James4Ever0

vllm vllm copied to clipboard

[Core] Add retention policy code for processing requests

vllm
vllm copied to clipboard