What happens when GPU KV cache usage reaching 100%?

Open Arcmoon-Hu opened this issue 2 years ago • 2 comments

As the question showed, I meet some case that the output may be so long and the GPU KV cache usage keeps increasing， and reaches 100%，then model don't genearate anything and don't recieve other request. Finally, I only restart it

Nov 09 '23 11:11 Arcmoon-Hu

I'm also experiencing this. At least, if num of prompt tokens is longer than num of GPU blocks * block_size, entire server stops working. Setting smaller max_model_len seems to alleviate problem https://github.com/vllm-project/vllm/issues/1206#issuecomment-1752339047

Nov 09 '23 13:11 twaka

At present, we have found a workaround and set the swap space directly to 0. This way, we will not call the CPU swap space and will not report any errors. However, the CPU blocks will also become 0, which may slow down the speed a bit, but at least it will not hang and die.

Jan 04 '24 07:01 chi2liu