vllm
vllm copied to clipboard
I want to close kv cache. if i set gpu_memory_utilization is 0. Does it means that i close the kv cache?
Hi @amulil, gpu_memory_utilization
means the ratio of the GPU memory you want to allow for vLLM. vLLM will use it to store weights, allocate some workspace, and allocate KV cache. If it is set too low, vLLM won't work at all and you will see errors. To not see those errors and to get the best performance, set it as high as possible.
Does this mean that I can't turn off KV cache now? Turning on KV cache will cause the model to use historical data to generate answers every time. I don't want the model to use historical data to generate answers, so I want to turn off KV cache. Is there any configuration item in vllm that can turn off KV cache? @WoosukKwon
The use_cache flag can control the opening and closing of KV cache when loading the model in Hugging Face style. This configuration item is currently unavailable in vllm.
@amulil vLLM does not support use_cache=False
. I believe there is no reason to disable KV cache because it is a pure optimization that significantly reduces the FLOPs of generation.
And please note that enabling KV cache never affects your model outputs.
And please note that enabling KV cache never affects your model outputs.
I tested the hugging face model with use_cache. use_cache=true
causes the output to be the same if I type the same thing multiple times in a multi-turn conversation, but if I set use_cache=false
, the output is different even if I type the same thing multiple times.
@amulil KV caching is used for different purpose in Vllm compared with huggingface caching.
And please note that enabling KV cache never affects your model outputs.
@WoosukKwon I'm getting different output almost 10-15% of the time when the KV cache is enabled with my Finetuned LLM. Output is correct whenever inferred without any optimization technique. Is there a way to disable KV Caching in vLLM?
@WoosukKwon , Can we clear the KV cache in the GPU after prompting? I see that the GPU usage keeps increasing with every prompting. Is there any other solution for this?
folks, clearing (or disabling / closing) the KV cache can be an optimization as well, if it prevents one from reloading the engine for every config change when tuning latency and throughput right? please recognize the feature request, thank you