vllm
vllm copied to clipboard
GPU KV cache usage: 100.0%以后就卡住
GPU KV cache usage: 100.0%以后就卡住,GPU使用率也将为0,无法继续提供服务,请问有什么解决办法吗?
我也遇到了这个问题
Please provide the OS, CUDA version, CPU, CPU RAM, GPU(s), GPU VRAM sizes, command line you started the vLLM with, model used and the full vLLM log output for diagnosis.
Ps.: Writing in English allows more contributors to help you.
Please provide the OS, CUDA version, CPU, CPU RAM, GPU(s), GPU VRAM sizes, command line you started the vLLM with, model used and the full vLLM log output for diagnosis.
Ps.: Writing in English allows more contributors to help you.
Thanks, In the process of inference,GPU KV cache usage come to 100.0%, and then GPU utilization rate will be 0, unable to continue providing services. Is there any solution?
the following is the information I provided. Would it be useful?
OS:CentOS7.7 CUDA Version: 11.4 Tesla V100 32G I start api server on model Baichuan2-13B-Chat ,start command line: CUDA_VISIBLE_DEVICES=0 python api_server.py --model /data0/models/Baichuan2-13B-Chat --host '0.0.0.0' --port 5000 --trust-remote-code --dtype half --max-num-batched-tokens 4000 --served-model-name Baichuan2-13B-Chat
Does the problem happen on the first request or only after doing inference tasks for a while?
Is this the vLLM API server or the OpenAI compatible one?
(Both of them named api_server.py
, but in different folders. Something to fix...)
Why do you need CUDA_VISIBLE_DEVICES=0
?
Try changing these command line parameters:
-
--block-size
-
--swap-space
For example: --swap-space=8
Same thing for me after I use vLLM for awhile KV cache goes to 100% and the system freezes up. When does cache get cleared? I am using the OpenAI connection.
Ubuntu 20.04.6 LTS Build cuda_11.8.r11.8/compiler.31833905_0 A10 NVIDIA python -m vllm.entrypoints.openai.api_server --model TheBloke/Wizard-Vicuna-13B-Uncensored-AWQ --quantization awq --host 0.0.0.0 --gpu-memory-utilization 0.40
GPU KV cache usage: 100.0%以后就卡住,GPU使用率也将为0,无法继续提供服务,请问有什么解决办法吗?
What about using early releases of vllm? v0.1.4 or earlier? @wgx7054
So is there any solution to this issue?
Does the problem happen on the first request or only after doing inference tasks for a while?
Is this the vLLM API server or the OpenAI compatible one? (Both of them named
api_server.py
, but in different folders. Something to fix...)Why do you need
CUDA_VISIBLE_DEVICES=0
?Try changing these command line parameters:
--block-size
--swap-space
For example:
--swap-space=8
Whether it's the first request or has been running for a while, as long as the GPU cache reaches 100%, the service becomes unusable, which is usually due to the generated text being longer.
The api_server I started merges the vLLM and the OpenAI compatible, but it doesn't matter which interface is used.
Specifying the first GPU is to not use too many resources.
The --block-size parameter seems to have a default value of 16, and the --swap-space parameter has a default value of 4, but the CPU cache usage is always 0%. I did not specify these two parameters, so the default values were used.
GPU KV cache usage: 100.0%以后就卡住,GPU使用率也将为0,无法继续提供服务,请问有什么解决办法吗?
What about using early releases of vllm? v0.1.4 or earlier? @wgx7054
It doesn't seem to work. Have you tried an older version? When the prompt is longer or the generated text is longer, can the service continue if the GPU cache usage rate rises to 100%?
GPU KV cache usage: 100.0%以后就卡住,GPU使用率也将为0,无法继续提供服务,请问有什么解决办法吗?
What about using early releases of vllm? v0.1.4 or earlier? @wgx7054
It doesn't seem to work. Have you tried an older version? When the prompt is longer or the generated text is longer, can the service continue if the GPU cache usage rate rises to 100%?
The old version also has the same problem.
Does the problem happen on the first request or only after doing inference tasks for a while? Is this the vLLM API server or the OpenAI compatible one? (Both of them named
api_server.py
, but in different folders. Something to fix...) Why do you needCUDA_VISIBLE_DEVICES=0
? Try changing these command line parameters:
--block-size
--swap-space
For example:
--swap-space=8
Whether it's the first request or has been running for a while, as long as the GPU cache reaches 100%, the service becomes unusable, which is usually due to the generated text being longer.
The api_server I started merges the vLLM and the OpenAI compatible, but it doesn't matter which interface is used.
Specifying the first GPU is to not use too many resources.
The --block-size parameter seems to have a default value of 16, and the --swap-space parameter has a default value of 4, but the CPU cache usage is always 0%. I did not specify these two parameters, so the default values were used.
How to use CPU KV?
Simply tweaking --swap-space
won't solve the problem
I guess the current solution is to carefully tune some parameters, like max_tokens
or max_model_len
. I guess the max_model_len
should be set smaller than the one specified in the hf-config.
The number of "vllm available gpu block" is related to max_num_batched_tokens
and max_num_seqs
, but these two variables only limit prompt
, which is the prefill phase, but not the generation phase. If a prompt fits in the gpu memory in prefill phase, but run oom in generation phase, vllm will not handle (truncate or abort or whatever) it automatically and just get stuck.
And you should really use more gpu memory for longer text generation tasks. Swap is a sequence level operation. If there is only one sequence running and it still can not finish due to gpu memory limits, cpu cache won't help.
Also encountered the same problem, how to solve this problem? @viktor-ferenczi @WoosukKwon
Try to set --max-num-seqs=1
and see whether it fixes the problem. If yes, then double this value for better total throughput until it fails again. Then go back to the highest working value. It worked for me at least with vLLM 0.2.4 and 0.2.5.
At present, we have found a workaround and set the swap space directly to 0. This way, we will not call the CPU swap space and will not report any errors. However, the CPU blocks will also become 0, which may slow down the speed a bit, but at least it will not hang and die.
Also encountered the same problem, how to solve this problem? @viktor-ferenczi @WoosukKwon
At present, we have found a workaround and set the swap space directly to 0. This way, we will not call the CPU swap space and will not report any errors. However, the CPU blocks will also become 0, which may slow down the speed a bit, but at least it will not hang and die.
Tried the workaround of --swap-space=0
, but it did not help at all. The same periodic freeze happens at full GPU KV-cache. Only workaround is to detect the frozen vLLM server and restart it (keepalive script).
I grabbed the flame chart, and the problem was gptq.py#apply_weights#ops.gptq_gemm。May be beyond the capacity of cuda computing。
Met the same issue in Offline Batched Inference. Wouldn't continue when stuck in the line LLM()
. GPU memory usage was occupied, but GPU utilization was 0%.