vllm Prompt caching

I saw other folks proposed the feature of caching overlapping prompts for reuse. For example, when the system prompt includes few-shot examples (long), encoding it every request is not efficient.

This newly released paper maybe useful: https://huggingface.co/papers/2311.04934.

Nov 13 '23 00:11 AIApprentice101

This paper maybe helpful, https://arxiv.org/pdf/2311.04934.pdf

Nov 13 '23 02:11 irasin

@irasin we are sharing the same paper lol~

Nov 13 '23 02:11 AIApprentice101

@AIApprentice101 Sorry, didn't notice that

Nov 13 '23 02:11 irasin

I think the vLLM authors are aware of this problem because they mentioned it in their paper (text begins from the end of page 7 and Figure 10). However, I need help finding the implementation in their released code.

Related issue: https://github.com/vllm-project/vllm/issues/1627

@WoosukKwon

Nov 14 '23 03:11 rayleizhu

We are aware of this new approach and actively evaluating it.

Nov 15 '23 19:11 simon-mo

We are aware of this new approach and actively evaluating it.我们意识到这种新方法并积极评估它。

I also used few shots in prompt, but it's really slow now

Dec 20 '23 10:12 linchen111

Any updates here?

Jan 29 '24 07:01 bryanhpchiang

We are aware of this new approach and actively evaluating it.我们意识到这种新方法并积极评估它。

I also used few shots in prompt, but it's really slow now

This makes sense, as Prompt Cache only helps the prompt phase, which prefills the KV Cache and generates the first response token. It cannot help the autoregressive generation phase, which is the hot spot of the response generation.

See how to further improve the efficiency of LLM serving with a long shared system prompt here: RelayAttention for Efficient Large Language Model Serving with Long System Prompts

@simon-mo Do you consider integrating RelayAttention into vLLM? If yes, I will be glad to participate.

Feb 23 '24 04:02 rayleizhu

--enable-prefix-caching

May 31 '24 20:05 hmellor

We are aware of this new approach and actively evaluating it.我们意识到这种新方法并积极评估它。

I also used few shots in prompt, but it's really slow now

This makes sense, as Prompt Cache only helps the prompt phase, which prefills the KV Cache and generates the first response token. It cannot help the autoregressive generation phase, which is the hot spot of the response generation.

See how to further improve the efficiency of LLM serving with a long shared system prompt here: RelayAttention for Efficient Large Language Model Serving with Long System Prompts

@simon-mo Do you consider integrating RelayAttention into vLLM? If yes, I will be glad to participate.

How is this better than just hashing the prefixes which is the simplest approach?

What's the difference in performance when enable-prefix-caching is used vs your approach?

Jul 28 '24 11:07 AnaRhisT94

vllm vllm copied to clipboard

Prompt caching

vllm
vllm copied to clipboard