vllm
vllm copied to clipboard
Prefix Prompt Cache
流程
- prompt cache 生成阶段
- 提供一个接口,让用户可以添加若干个prefix
- def add_prefix_template({prefix_name, prefix_content})
- 让vllm 生成对应prefix_content的kv cache,并且进行存储
- def add_request增加一个参数prefix_name
- 如果配置了,那么代表使用这个prefix_content的
- 如果传入没有生成过的prefix_name,报错(后续再根据使用情况调整)
实现
- prompt cache生成阶段
- add_prefix_template 存放在llm_engine里
- worker里面搞一个execute_prefix来生成prefix的kv cache,并且存储
- model计算中生成离散kv
- 计算后gather一下,生成连续
- 一个全局的dict,里面一个prefix_name对应一个seq_group,seq_group额外包含
- kv cache的离散显存, 如果prefix_token_id % block_table != 0 ,需要copy
- kv cache的连续显存
- add_request:
- 对应的kv cache保存两份
- 一份连续:便于prompt 计算,尽量避免显存申请
- 一份离散:便于generate计算
性能测试
prefix_tokens:prompt_tokens =1:3.05 | qps | prefix_prompt_throghput(tokens/s) | prompt_throghput | generate_throghput | prompt_time:generate_time | prompt_tokens:generate_tokens |
---|---|---|---|---|---|---|
prompt_cache | 26 | 17008.81 | 12814.70034 | 6360.76 | 66.84:4.87 | 36.72:1 |
no_prompt_cache | 21.849 | 13307.92 | 13307.92 | 6091 | 84.51:4.28 | 36.72:1 |
prefix:prompt=0.32:1 ,prompt速度提升26%
Hi @InkdyeHuang, thanks for your contribution. Did you get the same or similar results when using prompt prefix caching? Our team tried your code with llama but obtained incorrect results. Any suggestion is welcome.
Hi @InkdyeHuang, thanks for your contribution. Did you get the same or similar results when using prompt prefix caching? Our team tried your code with llama but obtained incorrect results. Any suggestion is welcome.
I has fix some bug, and try with llama model and get the same result when using prompt prefix caching. you can pull the lastest pr to try again.
Close this issue since we have merged #1669. Please see our follow up plan on automated prefix caching in #2614