vllm The async_llm_engine may have resource leak when using stream

look at this, the output here continues for half a hour and never stops but nothing is generated. New request is pendding.

Jun 30 '23 08:06 metacryptom

Token indices sequence length is longer than the specified maximum sequence length for this model (2620 > 2048). Running this sequence through the model will result in indexing errors INFO 06-30 12:07:20 scheduler.py:254] Throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 1 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0% INFO 06-30 12:07:25 scheduler.py:254] Throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 1 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0% INFO 06-30 12:07:30 scheduler.py:254] Throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 1 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0% INFO 06-30 12:07:35 scheduler.py:254] Throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 1 reqs,

Jun 30 '23 12:06 metacryptom

And this also make the server resource leak

Jun 30 '23 12:06 metacryptom

Closing as this should be fixed as mentioned in https://github.com/vllm-project/vllm/pull/325#issuecomment-1716627115

Mar 06 '24 11:03 hmellor