vllm GPU KV cache usage: 100.0%以后就卡住

GPU KV cache usage: 100.0%以后就卡住，GPU使用率也将为0，无法继续提供服务，请问有什么解决办法吗？

Sep 28 '23 01:09 wgx7054

我也遇到了这个问题

Sep 28 '23 02:09 FrankWhh

Please provide the OS, CUDA version, CPU, CPU RAM, GPU(s), GPU VRAM sizes, command line you started the vLLM with, model used and the full vLLM log output for diagnosis.

Ps.: Writing in English allows more contributors to help you.

Sep 28 '23 06:09 viktor-ferenczi

Please provide the OS, CUDA version, CPU, CPU RAM, GPU(s), GPU VRAM sizes, command line you started the vLLM with, model used and the full vLLM log output for diagnosis.

Ps.: Writing in English allows more contributors to help you.

Thanks, In the process of inference，GPU KV cache usage come to 100.0%, and then GPU utilization rate will be 0, unable to continue providing services. Is there any solution?

the following is the information I provided. Would it be useful?

OS：CentOS7.7 CUDA Version: 11.4 Tesla V100 32G I start api server on model Baichuan2-13B-Chat ,start command line: CUDA_VISIBLE_DEVICES=0 python api_server.py --model /data0/models/Baichuan2-13B-Chat --host '0.0.0.0' --port 5000 --trust-remote-code --dtype half --max-num-batched-tokens 4000 --served-model-name Baichuan2-13B-Chat

Sep 28 '23 08:09 wgx7054

Does the problem happen on the first request or only after doing inference tasks for a while?

Is this the vLLM API server or the OpenAI compatible one? (Both of them named api_server.py, but in different folders. Something to fix...)

Why do you need CUDA_VISIBLE_DEVICES=0 ?

Try changing these command line parameters:

--block-size
--swap-space

For example: --swap-space=8

Sep 28 '23 13:09 viktor-ferenczi

Same thing for me after I use vLLM for awhile KV cache goes to 100% and the system freezes up. When does cache get cleared? I am using the OpenAI connection.

Ubuntu 20.04.6 LTS Build cuda_11.8.r11.8/compiler.31833905_0 A10 NVIDIA python -m vllm.entrypoints.openai.api_server --model TheBloke/Wizard-Vicuna-13B-Uncensored-AWQ --quantization awq --host 0.0.0.0 --gpu-memory-utilization 0.40

log.txt

Sep 29 '23 23:09 SatoshiReport

GPU KV cache usage: 100.0%以后就卡住，GPU使用率也将为0，无法继续提供服务，请问有什么解决办法吗？

What about using early releases of vllm? v0.1.4 or earlier? @wgx7054

Sep 30 '23 01:09 hsm1997

So is there any solution to this issue?

Oct 06 '23 13:10 FrankWhh

Does the problem happen on the first request or only after doing inference tasks for a while?

Is this the vLLM API server or the OpenAI compatible one? (Both of them named api_server.py, but in different folders. Something to fix...)

Why do you need CUDA_VISIBLE_DEVICES=0 ?

Try changing these command line parameters:

--block-size

--swap-space

For example: --swap-space=8

Whether it's the first request or has been running for a while, as long as the GPU cache reaches 100%, the service becomes unusable, which is usually due to the generated text being longer.

The api_server I started merges the vLLM and the OpenAI compatible, but it doesn't matter which interface is used.

Specifying the first GPU is to not use too many resources.

The --block-size parameter seems to have a default value of 16, and the --swap-space parameter has a default value of 4, but the CPU cache usage is always 0%. I did not specify these two parameters, so the default values were used.

Oct 08 '23 11:10 wgx7054

GPU KV cache usage: 100.0%以后就卡住，GPU使用率也将为0，无法继续提供服务，请问有什么解决办法吗？

What about using early releases of vllm? v0.1.4 or earlier? @wgx7054

It doesn't seem to work. Have you tried an older version? When the prompt is longer or the generated text is longer, can the service continue if the GPU cache usage rate rises to 100%?

Oct 08 '23 11:10 wgx7054

GPU KV cache usage: 100.0%以后就卡住，GPU使用率也将为0，无法继续提供服务，请问有什么解决办法吗？

What about using early releases of vllm? v0.1.4 or earlier? @wgx7054

It doesn't seem to work. Have you tried an older version? When the prompt is longer or the generated text is longer, can the service continue if the GPU cache usage rate rises to 100%?

The old version also has the same problem.

Oct 09 '23 03:10 Tomorrowxxy

Does the problem happen on the first request or only after doing inference tasks for a while? Is this the vLLM API server or the OpenAI compatible one? (Both of them named api_server.py, but in different folders. Something to fix...) Why do you need CUDA_VISIBLE_DEVICES=0 ? Try changing these command line parameters:

--block-size

--swap-space

For example: --swap-space=8

Whether it's the first request or has been running for a while, as long as the GPU cache reaches 100%, the service becomes unusable, which is usually due to the generated text being longer.

The api_server I started merges the vLLM and the OpenAI compatible, but it doesn't matter which interface is used.

Specifying the first GPU is to not use too many resources.

The --block-size parameter seems to have a default value of 16, and the --swap-space parameter has a default value of 4, but the CPU cache usage is always 0%. I did not specify these two parameters, so the default values were used.

How to use CPU KV? Simply tweaking --swap-space won't solve the problem

Oct 09 '23 03:10 Tomorrowxxy

I guess the current solution is to carefully tune some parameters, like max_tokens or max_model_len. I guess the max_model_len should be set smaller than the one specified in the hf-config.

The number of "vllm available gpu block" is related to max_num_batched_tokens and max_num_seqs, but these two variables only limit prompt, which is the prefill phase, but not the generation phase. If a prompt fits in the gpu memory in prefill phase, but run oom in generation phase, vllm will not handle (truncate or abort or whatever) it automatically and just get stuck.

And you should really use more gpu memory for longer text generation tasks. Swap is a sequence level operation. If there is only one sequence running and it still can not finish due to gpu memory limits, cpu cache won't help.

Oct 09 '23 04:10 hsm1997

Also encountered the same problem, how to solve this problem? @viktor-ferenczi @WoosukKwon

Dec 21 '23 14:12 leoterry-ulrica

Try to set --max-num-seqs=1 and see whether it fixes the problem. If yes, then double this value for better total throughput until it fails again. Then go back to the highest working value. It worked for me at least with vLLM 0.2.4 and 0.2.5.

Dec 22 '23 22:12 viktor-ferenczi

At present, we have found a workaround and set the swap space directly to 0. This way, we will not call the CPU swap space and will not report any errors. However, the CPU blocks will also become 0, which may slow down the speed a bit, but at least it will not hang and die.

Jan 04 '24 07:01 chi2liu

Also encountered the same problem, how to solve this problem? @viktor-ferenczi @WoosukKwon

At present, we have found a workaround and set the swap space directly to 0. This way, we will not call the CPU swap space and will not report any errors. However, the CPU blocks will also become 0, which may slow down the speed a bit, but at least it will not hang and die.

Jan 04 '24 07:01 chi2liu

Tried the workaround of --swap-space=0, but it did not help at all. The same periodic freeze happens at full GPU KV-cache. Only workaround is to detect the frozen vLLM server and restart it (keepalive script).

Jan 04 '24 21:01 viktor-ferenczi

I grabbed the flame chart, and the problem was gptq.py#apply_weights#ops.gptq_gemm。May be beyond the capacity of cuda computing。

Mar 30 '24 08:03 RobertLiu0905

Met the same issue in Offline Batched Inference. Wouldn't continue when stuck in the line LLM(). GPU memory usage was occupied, but GPU utilization was 0%.

Apr 23 '24 07:04 shyringo

vllm vllm copied to clipboard

GPU KV cache usage: 100.0%以后就卡住

vllm
vllm copied to clipboard