vllm
vllm copied to clipboard
vLLM stops all processing when CPU KV cache is used, has to be shut down and restarted.
Hi
The issue: with --swap-space X
specified, as soon as CPU KV cache is used, vLLM stops all processing. CPU and GPU usage go to 0%, and the request never returns. Any future requests are also never answered. There is no error.
I am testing the latest vLLM code (commit 6fc2a38) in a Docker container. I have experienced the issue since I first started using vLLM about 4 days ago, so it's not specific to the latest commits.
I am launching vLLM with the following args:
--model lmsys/vicuna-7b-v1.3 --host 0.0.0.0 --tokenizer hf-internal-testing/llama-tokenizer --swap-space 100
I am currently testing on a 1 x 4090 system, but I have experienced it on all GPU types I've tried, including A6000 and H100.
The following test code will quickly trigger the issue on a 1 x 4090 system:
import time
import requests
url = 'http://localhost:8000/v1/completions'
headers = {'Content-Type': 'application/json'}
data = {
"model": "lmsys/vicuna-7b-v1.3",
"prompt": " Write a story about a cat named George."*40,
"max_tokens": 950,
"temperature": 0.7,
"n":125
}
s=time.time()
response = requests.post(url, headers=headers, json=data)
print(time.time()-s)
Here's a screenshot demonstrating the issue:
In the screenshot you can see that only 7.9% of CPU KV cache is used, but this is enough to cause all processing to stop. The server will now never answer this request, and never answer any new requests either. It is effectively dead.
If I leave out --swap-space X
then the server aborts RuntimeError: Aborted due to the lack of CPU swap space. Please increase the swap space to avoid this error.
, which is what I'm trying to avoid. It would be nice to be able to use CPU RAM as an overflow buffer, in case I occasionally exceed VRAM.
Thanks in advance.
I too can confirm that this issue persists with the default settings of 4GB swap space, in the first release version and the most recent versions.
I had the same problem, did you solve it?
No, I'm not sure it's something we can solve ourselves. Might need a code fix.
What I am doing now, as a workaround, is running without --swap-space
with a monitoring script that restarts vLLM whenever it aborts with RuntimeError: Aborted due to the lack of CPU swap space. Please increase the swap space to avoid this error.
Not ideal at all but it works for now.
Might be related: https://github.com/vllm-project/vllm/issues/667
https://github.com/vllm-project/vllm/blob/66c54aa9c33555a6b41421d57d3ad6c1bf004ec9/vllm/engine/async_llm_engine.py#L67-L75
I comment this await asyncio.sleep(0)
and it seems to temporarily solve the stuck.
Same issue. Cache fills up and then vLLM stops working.
this issue makes vllm impossible for production use
this issue makes vllm impossible for production use
At present, we have found a workaround and set the swap space directly to 0. This way, we will not call the CPU swap space and will not report any errors. However, the CPU blocks will also become 0, which may slow down the speed a bit, but at least it will not hang and die.
@TheBloke are you still experiencing this issue?
this issue makes vllm impossible for production use
At present, we have found a workaround and set the swap space directly to 0. This way, we will not call the CPU swap space and will not report any errors. However, the CPU blocks will also become 0, which may slow down the speed a bit, but at least it will not hang and die.
Wondering how can I set the swap space directly to 0?
--swap-space 0
- docs
this issue makes vllm impossible for production use
At present, we have found a workaround and set the swap space directly to 0. This way, we will not call the CPU swap space and will not report any errors. However, the CPU blocks will also become 0, which may slow down the speed a bit, but at least it will not hang and die.
When using vllm 0.5.1, setting swap_space=0 will cause the process to terminate once vllm tries to preempt a sequence group, despite # CPU blocks being 0.
ERROR 07-11 00:49:14 async_llm_engine.py:53] self._preempt_by_swap(seq_group, blocks_to_swap_out) ERROR 07-11 00:49:14 async_llm_engine.py:53] File "/opt/miniconda3/envs/working/lib/python3.11/site-packages/vllm/core/scheduler.py", line 1145, in _preempt_by_swap ERROR 07-11 00:49:14 async_llm_engine.py:53] self._swap_out(seq_group, blocks_to_swap_out) ERROR 07-11 00:49:14 async_llm_engine.py:53] File "/opt/miniconda3/envs/working/lib/python3.11/site-packages/vllm/core/scheduler.py", line 1165, in _swap_out ERROR 07-11 00:49:14 async_llm_engine.py:53] RuntimeError: Aborted due to the lack of CPU swap space. Please increase the swap space to avoid this error.
Potentially this is a bug that's been fixed in BlockSpaceMangerV2
?
You can enable it using --use-v2-block-manager