transformers icon indicating copy to clipboard operation
transformers copied to clipboard

how to remove kv cache?

Open TuuSiwei opened this issue 1 year ago • 8 comments

Feature request

When I use the generate() function of a language model for inference, the kv-cache is also stored in the GPU memory. Is there any way to clear this kv-cache before continuing to call generate()?

Motivation

I have a lot of text to process, so I use a for loop to call generate(). To avoid OOM, I need to clear the kv-cache before the end of each loop iteration.

Your contribution

none

TuuSiwei avatar Jun 30 '24 12:06 TuuSiwei

Hi @tsw123678, you should be able to pass in use_cache=False when calling generate. Note, this will result in significant slow downs when generating

amyeroberts avatar Jul 01 '24 09:07 amyeroberts

Hi @tsw123678, you should be able to pass in use_cache=False when calling generate. Note, this will result in significant slow downs when generating

thank you for your tips. since I need to handle a large number of conversations (over 10,000), I can't afford to sacrifice the time for kv cache. I just want to know if there is a way to clear the current kv cache after generate function.

TuuSiwei avatar Jul 01 '24 10:07 TuuSiwei

Hi @tsw123678, to be able to help, you would need to clarify the construction of the loop i.e. what's being looped over.

I'm I right in understanding that the GPU memory is not freed after a generate call?

cc @gante

amyeroberts avatar Jul 01 '24 13:07 amyeroberts

Hi @tsw123678, to be able to help, you would need to clarify the construction of the loop i.e. what's being looped over.

I'm I right in understanding that the GPU memory is not freed after a generate call?

cc @gante

year, my loop code can be simplified as follows:

user_prompt=[prompt 1, prompt 2, prompt 3 .... prompt n]

for prompt in user_prompt:
    input_ids=tokenizer.encode(prompt)
    out=model.generate(prompt)
    res=tokenizer.decode(out)
    # some IO operation to save the result
    
    del input_ids,out,res
    torch.cuda.empty_cache()

I've noticed that deleting variables such as input_ids that are located on CUDA does not prevent the OOM issue. After my analysis, I believe that the accumulation of the KV cache is the cause of the OOM problem. Thank you very much for your help.

TuuSiwei avatar Jul 01 '24 13:07 TuuSiwei

Hi folks, has there been any development on this? I am having an issue, described here: https://discuss.huggingface.co/t/kv-cache-managment/95481

I believe it is down to the static KV cache. Note I do not seem to experience any OOM and GPU utilisation stays at 20Gb/40Gb. Inference does just stop and leaves the process hanging. It would be nice to see some more practical documentation around this feature as it is not clear how to use/manage it beyond typical use.

Edit: in my case I use the text generation pipeline. The problem is present for Mistral 7B and Llama3-7B, the smaller Phi-3 (3.8B) does not bump into this issue.

Edit 2: Please disregard my comment, I have narrowed the issue down to Chromadb so it isnt relevant here. Though I am still interested in seeing an option to clear the cache

swtb3 avatar Jul 04 '24 09:07 swtb3

Also cc @ArthurZucker re cache

amyeroberts avatar Jul 05 '24 11:07 amyeroberts

Hey! In the generate function we automatically clear the cache here: https://github.com/huggingface/transformers/blob/cd8d08df4be23aa55873ecb58aa1f646d279107c/src/transformers/generation/utils.py#L1444 if that helps!

ArthurZucker avatar Jul 10 '24 12:07 ArthurZucker

You can also just delete the cache as it is an object returned by generate as well !

ArthurZucker avatar Jul 10 '24 12:07 ArthurZucker

@ArthurZucker - Can you pls share more instructions for deleting the cache returned by model.generate()? I am experiencing OOM error with iterative inference. I would like to clear the cache after certain number of iterations

prasiyer avatar Sep 27 '24 21:09 prasiyer

Hey! Actually 2 things you can do:

  1. set the max_new_tokens to a specific number. Once done you just clear the cache and re-generate. This should not induce recompile AFAIK
  2. Use the awesome offloaded static cache which will automatically offload the cache. To avoid OOMs that's where you should go! https://huggingface.co/docs/transformers/main/en/kv_cache#offloaded-static-cache

ArthurZucker avatar Oct 03 '24 12:10 ArthurZucker