PyTorch Engine hash table based prefix caching
Implementation of https://github.com/InternLM/lmdeploy/issues/1407#issuecomment-2044203407
I plan to refactor the implementation of s-lora so we do not need to change block size when enabling adapters.
@zhyncs @ispobock
So productive, we will review asap.
Hi @grimoire May you provide performance benchmark and evaluation result? ref https://github.com/InternLM/lmdeploy/issues/1407#issuecomment-2046702194
LGTM
@zhyncs
llama13b + 128 concurrency + 3000 prompts
prompt = SYSTEM_PROMPT + prompt
w/o caching
concurrency: 128
elapsed_time: 576.054s
number of prompt tokens: 758360
number of completion tokens: 725516
token throughput (completion token): 1259.459 token/s
token throughput (prompt + completion token): 2575.934 token/s
RPS (request per second): 5.208 req/s
RPM (request per minute): 312.471 req/min
with prefix caching
concurrency: 128
elapsed_time: 531.635s
number of prompt tokens: 758360
number of completion tokens: 725516
token throughput (completion token): 1364.688 token/s
token throughput (prompt + completion token): 2791.155 token/s
RPS (request per second): 5.643 req/s
RPM (request per minute): 338.578 req/min
@zhulinJulia24 please perform an evaluation test on the following models:
llama-2-7b, internlm-7b, internlm2-7b, internlm2-20b, qwen-7b, qwen1.5-7b
Datasets should include:
from .datasets.mmlu.mmlu_gen_a484b3 import mmlu_datasets
from .datasets.ceval.ceval_gen_5f30c7 import ceval_datasets
from .datasets.SuperGLUE_WiC.SuperGLUE_WiC_gen_d06864 import WiC_datasets
from .datasets.SuperGLUE_WSC.SuperGLUE_WSC_gen_7902a7 import WSC_datasets
from .datasets.triviaqa.triviaqa_gen_2121ce import triviaqa_datasets
from .datasets.gsm8k.gsm8k_gen_1d7fe4 import gsm8k_datasets
from .datasets.race.race_gen_69ee4f import race_datasets
from .datasets.crowspairs.crowspairs_gen_381af0 import crowspairs_datasets
@zhyncs
llama13b + 128 concurrency + 3000 prompts
prompt = SYSTEM_PROMPT + promptw/o caching
concurrency: 128 elapsed_time: 576.054s number of prompt tokens: 758360 number of completion tokens: 725516 token throughput (completion token): 1259.459 token/s token throughput (prompt + completion token): 2575.934 token/s RPS (request per second): 5.208 req/s RPM (request per minute): 312.471 req/minwith prefix caching
concurrency: 128 elapsed_time: 531.635s number of prompt tokens: 758360 number of completion tokens: 725516 token throughput (completion token): 1364.688 token/s token throughput (prompt + completion token): 2791.155 token/s RPS (request per second): 5.643 req/s RPM (request per minute): 338.578 req/min
The result of this is slightly different from the one obtained previously using https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py, as shown in https://github.com/InternLM/lmdeploy/issues/1407#issuecomment-2046702194. However, it is reasonable.
LGTM
ref https://github.com/InternLM/lmdeploy/pull/1450/files#r1570029820
@zhyncs I have made some mistakes when performing the benchmark above.
After update this line https://github.com/InternLM/lmdeploy/blob/1f72b8f33821051dafa35502f1efc2a60d2440c6/benchmark/profile_restful_api.py#L35
new result:
w/o caching
concurrency: 128
elapsed_time: 548.647s
number of prompt tokens: 1007244
number of completion tokens: 722476
token throughput (completion token): 1316.833 token/s
token throughput (prompt + completion token): 3152.704 token/s
RPS (request per second): 5.468 req/s
RPM (request per minute): 328.080 req/min
with caching
concurrency: 128
elapsed_time: 507.408s
number of prompt tokens: 1007244
number of completion tokens: 722476
token throughput (completion token): 1423.856 token/s
token throughput (prompt + completion token): 3408.932 token/s
RPS (request per second): 5.912 req/s
RPM (request per minute): 354.744 req/mi
| dataset | version | metric | mode | internlm-chat-7b-turbomind | internlm-chat-7b-pytorch | llama-2-7b-chat-turbomind | llama-2-7b-chat-pytorch | internlm2-chat-7b-turbomind | internlm2-chat-7b-pytorch | internlm2-chat-7b-hf | internlm2-chat-20b-turbomind | internlm2-chat-20b-pytorch | qwen-7b-chat-turbomind | qwen1.5-7b-chat-pytorch | qwen1.5-7b-chat-hf | qwen-7b-chat-hf |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| --------- 考试 Exam --------- | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| ceval | - | naive_average | gen | 53.05 | 54.04 | 28.44 | 28.51 | 61.38 | 58.09 | 61.86 | 63.58 | 63.36 | 59.36 | 70.68 | 70.67 | - |
| agieval | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| mmlu | - | naive_average | gen | - | 52.88 | 35.32 | 35.41 | 63.59 | 57.77 | 56.15 | 67.09 | 59.28 | 57.51 | 61.38 | 61.48 | - |
| GaokaoBench | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| ARC-c | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| --------- 语言 Language --------- | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| WiC | d06864 | accuracy | gen | 52.04 | 52.19 | 0 | 0 | 60.19 | 57.84 | 60.82 | 59.25 | 60.5 | 52.66 | 63.17 | 51.1 | - |
| summedits | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| chid-dev | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| afqmc-dev | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| bustm-dev | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| cluewsc-dev | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| WSC | 7902a7 | accuracy | gen | 60.58 | 60.58 | 0 | 0 | 68.27 | 55.77 | 65.38 | 50 | 49.04 | 32.69 | 41.35 | 37.5 | - |
| winogrande | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| flores_100 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| --------- 知识 Knowledge --------- | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| BoolQ | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| commonsense_qa | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| nq | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| triviaqa | 2121ce | score | gen | 37.84 | 37.77 | 56.09 | 56.11 | 58.48 | 55.92 | 58.23 | 64.07 | 63.96 | 54.37 | 44.49 | 44.76 | - |
| --------- 推理 Reasoning --------- | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| cmnli | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| ocnli | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| ocnli_fc-dev | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| AX_b | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| AX_g | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| CB | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| RTE | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| story_cloze | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| COPA | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| ReCoRD | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| hellaswag | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| piqa | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| siqa | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| strategyqa | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| math | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| gsm8k | 1d7fe4 | accuracy | gen | 34.8 | 34.57 | 28.2 | 27.98 | 71.57 | 37.98 | 45.11 | 75.36 | 68.61 | 55.27 | 48.67 | 55.5 | - |
| TheoremQA | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| openai_humaneval | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| mbpp | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| bbh | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| --------- 理解 Understanding --------- | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| C3 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| CMRC_dev | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| DRCD_dev | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| MultiRC | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| race-middle | 9a54b6 | accuracy | gen | 83.43 | 83.64 | 41.57 | 41.64 | 89.97 | 72.08 | 80.99 | 91.64 | 88.37 | 83.5 | 87.53 | 87.33 | - |
| race-high | 9a54b6 | accuracy | gen | 78.82 | 78.79 | 39.62 | 39.62 | 85.59 | 72.53 | 78.82 | 87.94 | 84.59 | 77.1 | 82.68 | 82.53 | - |
| openbookqa_fact | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| csl_dev | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| lcsts | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| Xsum | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| eprstmt-dev | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| lambada | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| tnews-dev | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
@zhulinJulia24 hi, pr_test failed with restful api. Is this failure caused by this PR?
will merge it after v0.4.1 is released on 5.8
Hi @grimoire Looks like there is a bug in this PR
lmdeploy serve api_server \
/path/to/Qwen \
--server-port 23333 \
--backend pytorch \
--cache-max-entry-count 0.95 \
--enable-prefix-caching \
--max-batch-size 128 --log-level DEBUG --tp 1
I am doing a code review and try to solve it.