What happens when GPU KV cache usage reaching 100%?
As the question showed, I meet some case that the output may be so long and the GPU KV cache usage keeps increasing, and reaches 100%,then model don't genearate anything and don't recieve other request. Finally, I only restart it
I'm also experiencing this. At least, if num of prompt tokens is longer than num of GPU blocks * block_size, entire server stops working. Setting smaller max_model_len seems to alleviate problem https://github.com/vllm-project/vllm/issues/1206#issuecomment-1752339047
At present, we have found a workaround and set the swap space directly to 0. This way, we will not call the CPU swap space and will not report any errors. However, the CPU blocks will also become 0, which may slow down the speed a bit, but at least it will not hang and die.