vllm icon indicating copy to clipboard operation
vllm copied to clipboard

I want to close kv cache. if i set gpu_memory_utilization is 0. Does it means that i close the kv cache?

Open amulil opened this issue 1 year ago • 10 comments

amulil avatar Aug 30 '23 11:08 amulil

Hi @amulil, gpu_memory_utilization means the ratio of the GPU memory you want to allow for vLLM. vLLM will use it to store weights, allocate some workspace, and allocate KV cache. If it is set too low, vLLM won't work at all and you will see errors. To not see those errors and to get the best performance, set it as high as possible.

WoosukKwon avatar Aug 31 '23 03:08 WoosukKwon

Does this mean that I can't turn off KV cache now? Turning on KV cache will cause the model to use historical data to generate answers every time. I don't want the model to use historical data to generate answers, so I want to turn off KV cache. Is there any configuration item in vllm that can turn off KV cache? @WoosukKwon

amulil avatar Aug 31 '23 03:08 amulil

The use_cache flag can control the opening and closing of KV cache when loading the model in Hugging Face style. This configuration item is currently unavailable in vllm.

amulil avatar Aug 31 '23 03:08 amulil

@amulil vLLM does not support use_cache=False. I believe there is no reason to disable KV cache because it is a pure optimization that significantly reduces the FLOPs of generation.

WoosukKwon avatar Sep 01 '23 05:09 WoosukKwon

And please note that enabling KV cache never affects your model outputs.

WoosukKwon avatar Sep 01 '23 05:09 WoosukKwon

And please note that enabling KV cache never affects your model outputs.

I tested the hugging face model with use_cache. use_cache=true causes the output to be the same if I type the same thing multiple times in a multi-turn conversation, but if I set use_cache=false, the output is different even if I type the same thing multiple times.

amulil avatar Sep 01 '23 05:09 amulil

@amulil KV caching is used for different purpose in Vllm compared with huggingface caching.

Rahmat711 avatar Dec 20 '23 16:12 Rahmat711

And please note that enabling KV cache never affects your model outputs.

@WoosukKwon I'm getting different output almost 10-15% of the time when the KV cache is enabled with my Finetuned LLM. Output is correct whenever inferred without any optimization technique. Is there a way to disable KV Caching in vLLM?

rajeevbaalwan avatar Mar 04 '24 16:03 rajeevbaalwan

@WoosukKwon , Can we clear the KV cache in the GPU after prompting? I see that the GPU usage keeps increasing with every prompting. Is there any other solution for this?

Sankethhhh avatar Mar 06 '24 04:03 Sankethhhh

folks, clearing (or disabling / closing) the KV cache can be an optimization as well, if it prevents one from reloading the engine for every config change when tuning latency and throughput right? please recognize the feature request, thank you

nightflight-dk avatar May 15 '24 02:05 nightflight-dk