lmdeploy PyTorch Engine hash table based prefix caching

Implementation of https://github.com/InternLM/lmdeploy/issues/1407#issuecomment-2044203407

I plan to refactor the implementation of s-lora so we do not need to change block size when enabling adapters.

@zhyncs @ispobock

Apr 12 '24 07:04 grimoire

So productive, we will review asap.

Apr 12 '24 07:04 zhyncs

Hi @grimoire May you provide performance benchmark and evaluation result? ref https://github.com/InternLM/lmdeploy/issues/1407#issuecomment-2046702194

Apr 12 '24 07:04 zhyncs

LGTM

Apr 15 '24 05:04 ispobock

@zhyncs

llama13b + 128 concurrency + 3000 prompts prompt = SYSTEM_PROMPT + prompt

w/o caching

concurrency: 128
elapsed_time: 576.054s

number of prompt tokens: 758360
number of completion tokens: 725516
token throughput (completion token): 1259.459 token/s
token throughput (prompt + completion token): 2575.934 token/s
RPS (request per second): 5.208 req/s
RPM (request per minute): 312.471 req/min

with prefix caching

concurrency: 128
elapsed_time: 531.635s

number of prompt tokens: 758360
number of completion tokens: 725516
token throughput (completion token): 1364.688 token/s
token throughput (prompt + completion token): 2791.155 token/s
RPS (request per second): 5.643 req/s
RPM (request per minute): 338.578 req/min

Apr 18 '24 03:04 grimoire

@zhulinJulia24 please perform an evaluation test on the following models:

llama-2-7b, internlm-7b, internlm2-7b, internlm2-20b, qwen-7b, qwen1.5-7b

Datasets should include:

from .datasets.mmlu.mmlu_gen_a484b3 import mmlu_datasets
from .datasets.ceval.ceval_gen_5f30c7 import ceval_datasets
from .datasets.SuperGLUE_WiC.SuperGLUE_WiC_gen_d06864 import WiC_datasets
from .datasets.SuperGLUE_WSC.SuperGLUE_WSC_gen_7902a7 import WSC_datasets
from .datasets.triviaqa.triviaqa_gen_2121ce import triviaqa_datasets
from .datasets.gsm8k.gsm8k_gen_1d7fe4 import gsm8k_datasets
from .datasets.race.race_gen_69ee4f import race_datasets
from .datasets.crowspairs.crowspairs_gen_381af0 import crowspairs_datasets

Apr 18 '24 03:04 lvhan028

@zhyncs

llama13b + 128 concurrency + 3000 prompts prompt = SYSTEM_PROMPT + prompt

w/o caching

concurrency: 128
elapsed_time: 576.054s

number of prompt tokens: 758360
number of completion tokens: 725516
token throughput (completion token): 1259.459 token/s
token throughput (prompt + completion token): 2575.934 token/s
RPS (request per second): 5.208 req/s
RPM (request per minute): 312.471 req/min

with prefix caching

concurrency: 128
elapsed_time: 531.635s

number of prompt tokens: 758360
number of completion tokens: 725516
token throughput (completion token): 1364.688 token/s
token throughput (prompt + completion token): 2791.155 token/s
RPS (request per second): 5.643 req/s
RPM (request per minute): 338.578 req/min

The result of this is slightly different from the one obtained previously using https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py, as shown in https://github.com/InternLM/lmdeploy/issues/1407#issuecomment-2046702194. However, it is reasonable.

Apr 18 '24 06:04 zhyncs

LGTM

Apr 18 '24 06:04 zhyncs

ref https://github.com/InternLM/lmdeploy/pull/1450/files#r1570029820

Apr 18 '24 06:04 zhyncs

@zhyncs I have made some mistakes when performing the benchmark above.

After update this line https://github.com/InternLM/lmdeploy/blob/1f72b8f33821051dafa35502f1efc2a60d2440c6/benchmark/profile_restful_api.py#L35

new result:

w/o caching

concurrency: 128
elapsed_time: 548.647s

number of prompt tokens: 1007244
number of completion tokens: 722476
token throughput (completion token): 1316.833 token/s
token throughput (prompt + completion token): 3152.704 token/s
RPS (request per second): 5.468 req/s
RPM (request per minute): 328.080 req/min

with caching

concurrency: 128
elapsed_time: 507.408s

number of prompt tokens: 1007244
number of completion tokens: 722476
token throughput (completion token): 1423.856 token/s
token throughput (prompt + completion token): 3408.932 token/s
RPS (request per second): 5.912 req/s
RPM (request per minute): 354.744 req/mi

Apr 18 '24 06:04 grimoire

dataset version metric mode internlm-chat-7b-turbomind internlm-chat-7b-pytorch llama-2-7b-chat-turbomind llama-2-7b-chat-pytorch internlm2-chat-7b-turbomind internlm2-chat-7b-pytorch internlm2-chat-7b-hf internlm2-chat-20b-turbomind internlm2-chat-20b-pytorch qwen-7b-chat-turbomind qwen1.5-7b-chat-pytorch qwen1.5-7b-chat-hf qwen-7b-chat-hf

--------- 考试 Exam --------- - - - - - - - - - - - - - - - -

ceval - naive_average gen 53.05 54.04 28.44 28.51 61.38 58.09 61.86 63.58 63.36 59.36 70.68 70.67 -

agieval - - - - - - - - - - - - - - - -

mmlu - naive_average gen - 52.88 35.32 35.41 63.59 57.77 56.15 67.09 59.28 57.51 61.38 61.48 -

GaokaoBench - - - - - - - - - - - - - - - -

ARC-c - - - - - - - - - - - - - - - -

--------- 语言 Language --------- - - - - - - - - - - - - - - - -

WiC d06864 accuracy gen 52.04 52.19 0 0 60.19 57.84 60.82 59.25 60.5 52.66 63.17 51.1 -

summedits - - - - - - - - - - - - - - - -

chid-dev - - - - - - - - - - - - - - - -

afqmc-dev - - - - - - - - - - - - - - - -

bustm-dev - - - - - - - - - - - - - - - -

cluewsc-dev - - - - - - - - - - - - - - - -

WSC 7902a7 accuracy gen 60.58 60.58 0 0 68.27 55.77 65.38 50 49.04 32.69 41.35 37.5 -

winogrande - - - - - - - - - - - - - - - -

flores_100 - - - - - - - - - - - - - - - -

--------- 知识 Knowledge --------- - - - - - - - - - - - - - - - -

BoolQ - - - - - - - - - - - - - - - -

commonsense_qa - - - - - - - - - - - - - - - -

nq - - - - - - - - - - - - - - - -

triviaqa 2121ce score gen 37.84 37.77 56.09 56.11 58.48 55.92 58.23 64.07 63.96 54.37 44.49 44.76 -

--------- 推理 Reasoning --------- - - - - - - - - - - - - - - - -

cmnli - - - - - - - - - - - - - - - -

ocnli - - - - - - - - - - - - - - - -

ocnli_fc-dev - - - - - - - - - - - - - - - -

AX_b - - - - - - - - - - - - - - - -

AX_g - - - - - - - - - - - - - - - -

CB - - - - - - - - - - - - - - - -

RTE - - - - - - - - - - - - - - - -

story_cloze - - - - - - - - - - - - - - - -

COPA - - - - - - - - - - - - - - - -

ReCoRD - - - - - - - - - - - - - - - -

hellaswag - - - - - - - - - - - - - - - -

piqa - - - - - - - - - - - - - - - -

siqa - - - - - - - - - - - - - - - -

strategyqa - - - - - - - - - - - - - - - -

math - - - - - - - - - - - - - - - -

gsm8k 1d7fe4 accuracy gen 34.8 34.57 28.2 27.98 71.57 37.98 45.11 75.36 68.61 55.27 48.67 55.5 -

TheoremQA - - - - - - - - - - - - - - - -

openai_humaneval - - - - - - - - - - - - - - - -

mbpp - - - - - - - - - - - - - - - -

bbh - - - - - - - - - - - - - - - -

--------- 理解 Understanding --------- - - - - - - - - - - - - - - - -

C3 - - - - - - - - - - - - - - - -

CMRC_dev - - - - - - - - - - - - - - - -

DRCD_dev - - - - - - - - - - - - - - - -

MultiRC - - - - - - - - - - - - - - - -

race-middle 9a54b6 accuracy gen 83.43 83.64 41.57 41.64 89.97 72.08 80.99 91.64 88.37 83.5 87.53 87.33 -

race-high 9a54b6 accuracy gen 78.82 78.79 39.62 39.62 85.59 72.53 78.82 87.94 84.59 77.1 82.68 82.53 -

openbookqa_fact - - - - - - - - - - - - - - - -

csl_dev - - - - - - - - - - - - - - - -

lcsts - - - - - - - - - - - - - - - -

Xsum - - - - - - - - - - - - - - - -

eprstmt-dev - - - - - - - - - - - - - - - -

lambada - - - - - - - - - - - - - - - -

tnews-dev - - - - - - - - - - - - - - - -

dataset	version	metric	mode	internlm-chat-7b-turbomind	internlm-chat-7b-pytorch	llama-2-7b-chat-turbomind	llama-2-7b-chat-pytorch	internlm2-chat-7b-turbomind	internlm2-chat-7b-pytorch	internlm2-chat-7b-hf	internlm2-chat-20b-turbomind	internlm2-chat-20b-pytorch	qwen-7b-chat-turbomind	qwen1.5-7b-chat-pytorch	qwen1.5-7b-chat-hf	qwen-7b-chat-hf
--------- 考试 Exam ---------	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
ceval	-	naive_average	gen	53.05	54.04	28.44	28.51	61.38	58.09	61.86	63.58	63.36	59.36	70.68	70.67	-
agieval	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
mmlu	-	naive_average	gen	-	52.88	35.32	35.41	63.59	57.77	56.15	67.09	59.28	57.51	61.38	61.48	-
GaokaoBench	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
ARC-c	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
--------- 语言 Language ---------	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
WiC	d06864	accuracy	gen	52.04	52.19	0	0	60.19	57.84	60.82	59.25	60.5	52.66	63.17	51.1	-
summedits	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
chid-dev	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
afqmc-dev	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
bustm-dev	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
cluewsc-dev	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
WSC	7902a7	accuracy	gen	60.58	60.58	0	0	68.27	55.77	65.38	50	49.04	32.69	41.35	37.5	-
winogrande	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
flores_100	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
--------- 知识 Knowledge ---------	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
BoolQ	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
commonsense_qa	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
nq	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
triviaqa	2121ce	score	gen	37.84	37.77	56.09	56.11	58.48	55.92	58.23	64.07	63.96	54.37	44.49	44.76	-
--------- 推理 Reasoning ---------	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
cmnli	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
ocnli	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
ocnli_fc-dev	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
AX_b	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
AX_g	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
CB	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
RTE	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
story_cloze	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
COPA	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
ReCoRD	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
hellaswag	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
piqa	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
siqa	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
strategyqa	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
math	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
gsm8k	1d7fe4	accuracy	gen	34.8	34.57	28.2	27.98	71.57	37.98	45.11	75.36	68.61	55.27	48.67	55.5	-
TheoremQA	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
openai_humaneval	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
mbpp	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
bbh	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
--------- 理解 Understanding ---------	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
C3	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
CMRC_dev	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
DRCD_dev	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
MultiRC	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
race-middle	9a54b6	accuracy	gen	83.43	83.64	41.57	41.64	89.97	72.08	80.99	91.64	88.37	83.5	87.53	87.33	-
race-high	9a54b6	accuracy	gen	78.82	78.79	39.62	39.62	85.59	72.53	78.82	87.94	84.59	77.1	82.68	82.53	-
openbookqa_fact	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
csl_dev	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
lcsts	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Xsum	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
eprstmt-dev	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
lambada	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
tnews-dev	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-

Apr 21 '24 03:04 zhulinJulia24

@zhulinJulia24 hi, pr_test failed with restful api. Is this failure caused by this PR?

Apr 29 '24 13:04 RunningLeon

will merge it after v0.4.1 is released on 5.8

Apr 30 '24 01:04 lvhan028

Hi @grimoire Looks like there is a bug in this PR

lmdeploy serve api_server \
    /path/to/Qwen \
    --server-port 23333 \
    --backend pytorch \
    --cache-max-entry-count 0.95 \
    --enable-prefix-caching \
    --max-batch-size 128 --log-level DEBUG --tp 1

May 08 '24 06:05 jjjjohnson

I am doing a code review and try to solve it.

May 08 '24 06:05 jjjjohnson