vllm icon indicating copy to clipboard operation
vllm copied to clipboard

Prefix Prompt Cache

Open InkdyeHuang opened this issue 1 year ago • 2 comments

流程

  1. prompt cache 生成阶段
  • 提供一个接口,让用户可以添加若干个prefix
    • def add_prefix_template({prefix_name, prefix_content})
  1. 让vllm 生成对应prefix_content的kv cache,并且进行存储
  • request阶段
    • def add_request增加一个参数prefix_name
      • 如果配置了,那么代表使用这个prefix_content的
      • 如果传入没有生成过的prefix_name,报错(后续再根据使用情况调整)


    实现

    1. prompt cache生成阶段
    • add_prefix_template 存放在llm_engine里
      • worker里面搞一个execute_prefix来生成prefix的kv cache,并且存储
      • model计算中生成离散kv
      • 计算后gather一下,生成连续
      • 一个全局的dict,里面一个prefix_name对应一个seq_group,seq_group额外包含
    1. kv cache的离散显存, 如果prefix_token_id % block_table != 0 ,需要copy
    2. kv cache的连续显存
  • request阶段
    • add_request:
      • 对应的kv cache保存两份
    1. 一份连续:便于prompt 计算,尽量避免显存申请
    2. 一份离散:便于generate计算

    性能测试

    prefix_tokens:prompt_tokens =1:3.05 qps prefix_prompt_throghput(tokens/s) prompt_throghput generate_throghput prompt_time:generate_time prompt_tokens:generate_tokens
    prompt_cache 26 17008.81 12814.70034 6360.76 66.84:4.87 36.72:1
    no_prompt_cache 21.849 13307.92 13307.92 6091 84.51:4.28 36.72:1

    prefix:prompt=0.32:1 ,prompt速度提升26%

    InkdyeHuang avatar Dec 28 '23 08:12 InkdyeHuang

    Hi @InkdyeHuang, thanks for your contribution. Did you get the same or similar results when using prompt prefix caching? Our team tried your code with llama but obtained incorrect results. Any suggestion is welcome.

    chenxu2048 avatar Jan 05 '24 04:01 chenxu2048

    Hi @InkdyeHuang, thanks for your contribution. Did you get the same or similar results when using prompt prefix caching? Our team tried your code with llama but obtained incorrect results. Any suggestion is welcome.

    I has fix some bug, and try with llama model and get the same result when using prompt prefix caching. you can pull the lastest pr to try again.

    InkdyeHuang avatar Jan 05 '24 10:01 InkdyeHuang

    Close this issue since we have merged #1669. Please see our follow up plan on automated prefix caching in #2614

    zhuohan123 avatar Feb 19 '24 00:02 zhuohan123