vllm icon indicating copy to clipboard operation
vllm copied to clipboard

Prompt caching

Open AIApprentice101 opened this issue 1 year ago • 6 comments

I saw other folks proposed the feature of caching overlapping prompts for reuse. For example, when the system prompt includes few-shot examples (long), encoding it every request is not efficient.

This newly released paper maybe useful: https://huggingface.co/papers/2311.04934.

AIApprentice101 avatar Nov 13 '23 00:11 AIApprentice101

This paper maybe helpful, https://arxiv.org/pdf/2311.04934.pdf

irasin avatar Nov 13 '23 02:11 irasin

@irasin we are sharing the same paper lol~

AIApprentice101 avatar Nov 13 '23 02:11 AIApprentice101

@AIApprentice101 Sorry, didn't notice that

irasin avatar Nov 13 '23 02:11 irasin

I think the vLLM authors are aware of this problem because they mentioned it in their paper (text begins from the end of page 7 and Figure 10). However, I need help finding the implementation in their released code.

Related issue: https://github.com/vllm-project/vllm/issues/1627

@WoosukKwon

rayleizhu avatar Nov 14 '23 03:11 rayleizhu

We are aware of this new approach and actively evaluating it.

simon-mo avatar Nov 15 '23 19:11 simon-mo

We are aware of this new approach and actively evaluating it.我们意识到这种新方法并积极评估它。

I also used few shots in prompt, but it's really slow now

linchen111 avatar Dec 20 '23 10:12 linchen111

Any updates here?

bryanhpchiang avatar Jan 29 '24 07:01 bryanhpchiang

We are aware of this new approach and actively evaluating it.我们意识到这种新方法并积极评估它。

I also used few shots in prompt, but it's really slow now

This makes sense, as Prompt Cache only helps the prompt phase, which prefills the KV Cache and generates the first response token. It cannot help the autoregressive generation phase, which is the hot spot of the response generation.

See how to further improve the efficiency of LLM serving with a long shared system prompt here: RelayAttention for Efficient Large Language Model Serving with Long System Prompts

@simon-mo Do you consider integrating RelayAttention into vLLM? If yes, I will be glad to participate.

rayleizhu avatar Feb 23 '24 04:02 rayleizhu

--enable-prefix-caching

hmellor avatar May 31 '24 20:05 hmellor

We are aware of this new approach and actively evaluating it.我们意识到这种新方法并积极评估它。

I also used few shots in prompt, but it's really slow now

This makes sense, as Prompt Cache only helps the prompt phase, which prefills the KV Cache and generates the first response token. It cannot help the autoregressive generation phase, which is the hot spot of the response generation.

See how to further improve the efficiency of LLM serving with a long shared system prompt here: RelayAttention for Efficient Large Language Model Serving with Long System Prompts

@simon-mo Do you consider integrating RelayAttention into vLLM? If yes, I will be glad to participate.

How is this better than just hashing the prefixes which is the simplest approach?

What's the difference in performance when enable-prefix-caching is used vs your approach?

AnaRhisT94 avatar Jul 28 '24 11:07 AnaRhisT94