vllm Prefix Prompt Cache

流程

prompt cache 生成阶段

提供一个接口，让用户可以添加若干个prefix

def add_prefix_template({prefix_name, prefix_content})

让vllm 生成对应prefix_content的kv cache，并且进行存储

request阶段

def add_request增加一个参数prefix_name

如果配置了，那么代表使用这个prefix_content的
如果传入没有生成过的prefix_name，报错（后续再根据使用情况调整）

实现

prompt cache生成阶段

add_prefix_template 存放在llm_engine里

worker里面搞一个execute_prefix来生成prefix的kv cache，并且存储
model计算中生成离散kv
计算后gather一下，生成连续
一个全局的dict，里面一个prefix_name对应一个seq_group，seq_group额外包含

kv cache的离散显存, 如果prefix_token_id % block_table != 0 ，需要copy
kv cache的连续显存

request阶段

add_request：

对应的kv cache保存两份

一份连续：便于prompt 计算，尽量避免显存申请
一份离散：便于generate计算

性能测试

prefix_tokens:prompt_tokens =1:3.05	qps	prefix_prompt_throghput(tokens/s)	prompt_throghput	generate_throghput	prompt_time:generate_time	prompt_tokens:generate_tokens
prompt_cache	26	17008.81	12814.70034	6360.76	66.84:4.87	36.72:1
no_prompt_cache	21.849	13307.92	13307.92	6091	84.51:4.28	36.72:1

prefix:prompt=0.32:1 ,prompt速度提升26%

Dec 28 '23 08:12 InkdyeHuang

Hi @InkdyeHuang, thanks for your contribution. Did you get the same or similar results when using prompt prefix caching? Our team tried your code with llama but obtained incorrect results. Any suggestion is welcome.

Jan 05 '24 04:01 chenxu2048

Hi @InkdyeHuang, thanks for your contribution. Did you get the same or similar results when using prompt prefix caching? Our team tried your code with llama but obtained incorrect results. Any suggestion is welcome.

I has fix some bug, and try with llama model and get the same result when using prompt prefix caching. you can pull the lastest pr to try again.

Jan 05 '24 10:01 InkdyeHuang

Close this issue since we have merged #1669. Please see our follow up plan on automated prefix caching in #2614

Feb 19 '24 00:02 zhuohan123

vllm vllm copied to clipboard

Prefix Prompt Cache

vllm
vllm copied to clipboard