vllm icon indicating copy to clipboard operation
vllm copied to clipboard

vLLM stops all processing when CPU KV cache is used, has to be shut down and restarted.

Open TheBloke opened this issue 1 year ago • 12 comments

Hi

The issue: with --swap-space X specified, as soon as CPU KV cache is used, vLLM stops all processing. CPU and GPU usage go to 0%, and the request never returns. Any future requests are also never answered. There is no error.

I am testing the latest vLLM code (commit 6fc2a38) in a Docker container. I have experienced the issue since I first started using vLLM about 4 days ago, so it's not specific to the latest commits.

I am launching vLLM with the following args:

--model lmsys/vicuna-7b-v1.3 --host 0.0.0.0 --tokenizer hf-internal-testing/llama-tokenizer --swap-space 100 

I am currently testing on a 1 x 4090 system, but I have experienced it on all GPU types I've tried, including A6000 and H100.

The following test code will quickly trigger the issue on a 1 x 4090 system:

import time
import requests

url = 'http://localhost:8000/v1/completions'
headers = {'Content-Type': 'application/json'}
data = {
    "model": "lmsys/vicuna-7b-v1.3",
    "prompt": " Write a story about a cat named George."*40,
    "max_tokens": 950,
    "temperature": 0.7,
    "n":125
}
s=time.time()
response = requests.post(url, headers=headers, json=data)
print(time.time()-s)

Here's a screenshot demonstrating the issue: image

In the screenshot you can see that only 7.9% of CPU KV cache is used, but this is enough to cause all processing to stop. The server will now never answer this request, and never answer any new requests either. It is effectively dead.

If I leave out --swap-space X then the server aborts RuntimeError: Aborted due to the lack of CPU swap space. Please increase the swap space to avoid this error., which is what I'm trying to avoid. It would be nice to be able to use CPU RAM as an overflow buffer, in case I occasionally exceed VRAM.

Thanks in advance.

TheBloke avatar Jul 21 '23 17:07 TheBloke

I too can confirm that this issue persists with the default settings of 4GB swap space, in the first release version and the most recent versions.

syskn avatar Jul 22 '23 01:07 syskn

I had the same problem, did you solve it?

Lawliet-Xie avatar Jul 27 '23 09:07 Lawliet-Xie

No, I'm not sure it's something we can solve ourselves. Might need a code fix.

What I am doing now, as a workaround, is running without --swap-space with a monitoring script that restarts vLLM whenever it aborts with RuntimeError: Aborted due to the lack of CPU swap space. Please increase the swap space to avoid this error.

Not ideal at all but it works for now.

TheBloke avatar Jul 27 '23 10:07 TheBloke

Might be related: https://github.com/vllm-project/vllm/issues/667

syskn avatar Aug 06 '23 08:08 syskn

https://github.com/vllm-project/vllm/blob/66c54aa9c33555a6b41421d57d3ad6c1bf004ec9/vllm/engine/async_llm_engine.py#L67-L75

I comment this await asyncio.sleep(0) and it seems to temporarily solve the stuck.

desperadoola avatar Aug 10 '23 07:08 desperadoola

Same issue. Cache fills up and then vLLM stops working.

SatoshiReport avatar Sep 29 '23 23:09 SatoshiReport

this issue makes vllm impossible for production use

tydia avatar Sep 30 '23 06:09 tydia

this issue makes vllm impossible for production use

At present, we have found a workaround and set the swap space directly to 0. This way, we will not call the CPU swap space and will not report any errors. However, the CPU blocks will also become 0, which may slow down the speed a bit, but at least it will not hang and die.

chi2liu avatar Jan 04 '24 07:01 chi2liu

@TheBloke are you still experiencing this issue?

hmellor avatar Apr 03 '24 15:04 hmellor

this issue makes vllm impossible for production use

At present, we have found a workaround and set the swap space directly to 0. This way, we will not call the CPU swap space and will not report any errors. However, the CPU blocks will also become 0, which may slow down the speed a bit, but at least it will not hang and die.

Wondering how can I set the swap space directly to 0?

shyringo avatar Apr 23 '24 09:04 shyringo

--swap-space 0 - docs

hmellor avatar Apr 25 '24 17:04 hmellor

this issue makes vllm impossible for production use

At present, we have found a workaround and set the swap space directly to 0. This way, we will not call the CPU swap space and will not report any errors. However, the CPU blocks will also become 0, which may slow down the speed a bit, but at least it will not hang and die.

When using vllm 0.5.1, setting swap_space=0 will cause the process to terminate once vllm tries to preempt a sequence group, despite # CPU blocks being 0.

ERROR 07-11 00:49:14 async_llm_engine.py:53] self._preempt_by_swap(seq_group, blocks_to_swap_out) ERROR 07-11 00:49:14 async_llm_engine.py:53] File "/opt/miniconda3/envs/working/lib/python3.11/site-packages/vllm/core/scheduler.py", line 1145, in _preempt_by_swap ERROR 07-11 00:49:14 async_llm_engine.py:53] self._swap_out(seq_group, blocks_to_swap_out) ERROR 07-11 00:49:14 async_llm_engine.py:53] File "/opt/miniconda3/envs/working/lib/python3.11/site-packages/vllm/core/scheduler.py", line 1165, in _swap_out ERROR 07-11 00:49:14 async_llm_engine.py:53] RuntimeError: Aborted due to the lack of CPU swap space. Please increase the swap space to avoid this error.

haoxiongliu avatar Jul 10 '24 17:07 haoxiongliu

Potentially this is a bug that's been fixed in BlockSpaceMangerV2?

You can enable it using --use-v2-block-manager

hmellor avatar Aug 02 '24 17:08 hmellor