The async_llm_engine may have resource leak when using stream
look at this, the output here continues for half a hour and never stops but nothing is generated. New request is pendding.
Token indices sequence length is longer than the specified maximum sequence length for this model (2620 > 2048). Running this sequence through the model will result in indexing errors INFO 06-30 12:07:20 scheduler.py:254] Throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 1 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0% INFO 06-30 12:07:25 scheduler.py:254] Throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 1 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0% INFO 06-30 12:07:30 scheduler.py:254] Throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 1 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0% INFO 06-30 12:07:35 scheduler.py:254] Throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 1 reqs,
And this also make the server resource leak
Closing as this should be fixed as mentioned in https://github.com/vllm-project/vllm/pull/325#issuecomment-1716627115