vllm
vllm copied to clipboard
Prompt caching
I saw other folks proposed the feature of caching overlapping prompts for reuse. For example, when the system prompt includes few-shot examples (long), encoding it every request is not efficient.
This newly released paper maybe useful: https://huggingface.co/papers/2311.04934.
This paper maybe helpful, https://arxiv.org/pdf/2311.04934.pdf
@irasin we are sharing the same paper lol~
@AIApprentice101 Sorry, didn't notice that
I think the vLLM authors are aware of this problem because they mentioned it in their paper (text begins from the end of page 7 and Figure 10). However, I need help finding the implementation in their released code.
Related issue: https://github.com/vllm-project/vllm/issues/1627
@WoosukKwon
We are aware of this new approach and actively evaluating it.
We are aware of this new approach and actively evaluating it.我们意识到这种新方法并积极评估它。
I also used few shots in prompt, but it's really slow now
Any updates here?
We are aware of this new approach and actively evaluating it.我们意识到这种新方法并积极评估它。
I also used few shots in prompt, but it's really slow now
This makes sense, as Prompt Cache only helps the prompt phase, which prefills the KV Cache and generates the first response token. It cannot help the autoregressive generation phase, which is the hot spot of the response generation.
See how to further improve the efficiency of LLM serving with a long shared system prompt here: RelayAttention for Efficient Large Language Model Serving with Long System Prompts
@simon-mo Do you consider integrating RelayAttention into vLLM? If yes, I will be glad to participate.
--enable-prefix-caching
We are aware of this new approach and actively evaluating it.我们意识到这种新方法并积极评估它。
I also used few shots in prompt, but it's really slow now
This makes sense, as Prompt Cache only helps the prompt phase, which prefills the KV Cache and generates the first response token. It cannot help the autoregressive generation phase, which is the hot spot of the response generation.
See how to further improve the efficiency of LLM serving with a long shared system prompt here: RelayAttention for Efficient Large Language Model Serving with Long System Prompts
@simon-mo Do you consider integrating RelayAttention into vLLM? If yes, I will be glad to participate.
How is this better than just hashing the prefixes which is the simplest approach?
What's the difference in performance when enable-prefix-caching is used vs your approach?