vllm vLLM stops all processing when CPU KV cache is used, has to be shut down and restarted.

Hi

The issue: with --swap-space X specified, as soon as CPU KV cache is used, vLLM stops all processing. CPU and GPU usage go to 0%, and the request never returns. Any future requests are also never answered. There is no error.

I am testing the latest vLLM code (commit 6fc2a38) in a Docker container. I have experienced the issue since I first started using vLLM about 4 days ago, so it's not specific to the latest commits.

I am launching vLLM with the following args:

--model lmsys/vicuna-7b-v1.3 --host 0.0.0.0 --tokenizer hf-internal-testing/llama-tokenizer --swap-space 100

I am currently testing on a 1 x 4090 system, but I have experienced it on all GPU types I've tried, including A6000 and H100.

The following test code will quickly trigger the issue on a 1 x 4090 system:

import time
import requests

url = 'http://localhost:8000/v1/completions'
headers = {'Content-Type': 'application/json'}
data = {
    "model": "lmsys/vicuna-7b-v1.3",
    "prompt": " Write a story about a cat named George."*40,
    "max_tokens": 950,
    "temperature": 0.7,
    "n":125
}
s=time.time()
response = requests.post(url, headers=headers, json=data)
print(time.time()-s)

Here's a screenshot demonstrating the issue:

In the screenshot you can see that only 7.9% of CPU KV cache is used, but this is enough to cause all processing to stop. The server will now never answer this request, and never answer any new requests either. It is effectively dead.

If I leave out --swap-space X then the server aborts RuntimeError: Aborted due to the lack of CPU swap space. Please increase the swap space to avoid this error., which is what I'm trying to avoid. It would be nice to be able to use CPU RAM as an overflow buffer, in case I occasionally exceed VRAM.

Thanks in advance.

Jul 21 '23 17:07 TheBloke

I too can confirm that this issue persists with the default settings of 4GB swap space, in the first release version and the most recent versions.

Jul 22 '23 01:07 syskn

I had the same problem, did you solve it？

Jul 27 '23 09:07 Lawliet-Xie

No, I'm not sure it's something we can solve ourselves. Might need a code fix.

What I am doing now, as a workaround, is running without --swap-space with a monitoring script that restarts vLLM whenever it aborts with RuntimeError: Aborted due to the lack of CPU swap space. Please increase the swap space to avoid this error.

Not ideal at all but it works for now.

Jul 27 '23 10:07 TheBloke

Might be related: https://github.com/vllm-project/vllm/issues/667

Aug 06 '23 08:08 syskn

https://github.com/vllm-project/vllm/blob/66c54aa9c33555a6b41421d57d3ad6c1bf004ec9/vllm/engine/async_llm_engine.py#L67-L75

I comment this await asyncio.sleep(0) and it seems to temporarily solve the stuck.

Aug 10 '23 07:08 desperadoola

Same issue. Cache fills up and then vLLM stops working.

Sep 29 '23 23:09 SatoshiReport

this issue makes vllm impossible for production use

Sep 30 '23 06:09 tydia

this issue makes vllm impossible for production use

At present, we have found a workaround and set the swap space directly to 0. This way, we will not call the CPU swap space and will not report any errors. However, the CPU blocks will also become 0, which may slow down the speed a bit, but at least it will not hang and die.

Jan 04 '24 07:01 chi2liu

@TheBloke are you still experiencing this issue?

Apr 03 '24 15:04 hmellor

this issue makes vllm impossible for production use

At present, we have found a workaround and set the swap space directly to 0. This way, we will not call the CPU swap space and will not report any errors. However, the CPU blocks will also become 0, which may slow down the speed a bit, but at least it will not hang and die.

Wondering how can I set the swap space directly to 0?

Apr 23 '24 09:04 shyringo

--swap-space 0 - docs

Apr 25 '24 17:04 hmellor

this issue makes vllm impossible for production use

At present, we have found a workaround and set the swap space directly to 0. This way, we will not call the CPU swap space and will not report any errors. However, the CPU blocks will also become 0, which may slow down the speed a bit, but at least it will not hang and die.

When using vllm 0.5.1, setting swap_space=0 will cause the process to terminate once vllm tries to preempt a sequence group, despite # CPU blocks being 0.

ERROR 07-11 00:49:14 async_llm_engine.py:53] self._preempt_by_swap(seq_group, blocks_to_swap_out) ERROR 07-11 00:49:14 async_llm_engine.py:53] File "/opt/miniconda3/envs/working/lib/python3.11/site-packages/vllm/core/scheduler.py", line 1145, in _preempt_by_swap ERROR 07-11 00:49:14 async_llm_engine.py:53] self._swap_out(seq_group, blocks_to_swap_out) ERROR 07-11 00:49:14 async_llm_engine.py:53] File "/opt/miniconda3/envs/working/lib/python3.11/site-packages/vllm/core/scheduler.py", line 1165, in _swap_out ERROR 07-11 00:49:14 async_llm_engine.py:53] RuntimeError: Aborted due to the lack of CPU swap space. Please increase the swap space to avoid this error.

Jul 10 '24 17:07 haoxiongliu

Potentially this is a bug that's been fixed in BlockSpaceMangerV2?

You can enable it using --use-v2-block-manager

Aug 02 '24 17:08 hmellor

vllm vllm copied to clipboard

vLLM stops all processing when CPU KV cache is used, has to be shut down and restarted.

vllm
vllm copied to clipboard