lmdeploy icon indicating copy to clipboard operation
lmdeploy copied to clipboard

PyTorch Engine hash table based prefix caching

Open grimoire opened this issue 1 year ago • 12 comments

Implementation of https://github.com/InternLM/lmdeploy/issues/1407#issuecomment-2044203407

I plan to refactor the implementation of s-lora so we do not need to change block size when enabling adapters.

@zhyncs @ispobock

grimoire avatar Apr 12 '24 07:04 grimoire

So productive, we will review asap.

zhyncs avatar Apr 12 '24 07:04 zhyncs

Hi @grimoire May you provide performance benchmark and evaluation result? ref https://github.com/InternLM/lmdeploy/issues/1407#issuecomment-2046702194

zhyncs avatar Apr 12 '24 07:04 zhyncs

LGTM

ispobock avatar Apr 15 '24 05:04 ispobock

@zhyncs

llama13b + 128 concurrency + 3000 prompts prompt = SYSTEM_PROMPT + prompt

w/o caching

concurrency: 128
elapsed_time: 576.054s

number of prompt tokens: 758360
number of completion tokens: 725516
token throughput (completion token): 1259.459 token/s
token throughput (prompt + completion token): 2575.934 token/s
RPS (request per second): 5.208 req/s
RPM (request per minute): 312.471 req/min

with prefix caching

concurrency: 128
elapsed_time: 531.635s

number of prompt tokens: 758360
number of completion tokens: 725516
token throughput (completion token): 1364.688 token/s
token throughput (prompt + completion token): 2791.155 token/s
RPS (request per second): 5.643 req/s
RPM (request per minute): 338.578 req/min

grimoire avatar Apr 18 '24 03:04 grimoire

@zhulinJulia24 please perform an evaluation test on the following models:

llama-2-7b, internlm-7b, internlm2-7b, internlm2-20b, qwen-7b, qwen1.5-7b

Datasets should include:

from .datasets.mmlu.mmlu_gen_a484b3 import mmlu_datasets
from .datasets.ceval.ceval_gen_5f30c7 import ceval_datasets
from .datasets.SuperGLUE_WiC.SuperGLUE_WiC_gen_d06864 import WiC_datasets
from .datasets.SuperGLUE_WSC.SuperGLUE_WSC_gen_7902a7 import WSC_datasets
from .datasets.triviaqa.triviaqa_gen_2121ce import triviaqa_datasets
from .datasets.gsm8k.gsm8k_gen_1d7fe4 import gsm8k_datasets
from .datasets.race.race_gen_69ee4f import race_datasets
from .datasets.crowspairs.crowspairs_gen_381af0 import crowspairs_datasets

lvhan028 avatar Apr 18 '24 03:04 lvhan028

@zhyncs

llama13b + 128 concurrency + 3000 prompts prompt = SYSTEM_PROMPT + prompt

w/o caching

concurrency: 128
elapsed_time: 576.054s

number of prompt tokens: 758360
number of completion tokens: 725516
token throughput (completion token): 1259.459 token/s
token throughput (prompt + completion token): 2575.934 token/s
RPS (request per second): 5.208 req/s
RPM (request per minute): 312.471 req/min

with prefix caching

concurrency: 128
elapsed_time: 531.635s

number of prompt tokens: 758360
number of completion tokens: 725516
token throughput (completion token): 1364.688 token/s
token throughput (prompt + completion token): 2791.155 token/s
RPS (request per second): 5.643 req/s
RPM (request per minute): 338.578 req/min

The result of this is slightly different from the one obtained previously using https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py, as shown in https://github.com/InternLM/lmdeploy/issues/1407#issuecomment-2046702194. However, it is reasonable.

zhyncs avatar Apr 18 '24 06:04 zhyncs

LGTM

zhyncs avatar Apr 18 '24 06:04 zhyncs

ref https://github.com/InternLM/lmdeploy/pull/1450/files#r1570029820

zhyncs avatar Apr 18 '24 06:04 zhyncs

@zhyncs I have made some mistakes when performing the benchmark above.

After update this line https://github.com/InternLM/lmdeploy/blob/1f72b8f33821051dafa35502f1efc2a60d2440c6/benchmark/profile_restful_api.py#L35

new result:

w/o caching

concurrency: 128
elapsed_time: 548.647s

number of prompt tokens: 1007244
number of completion tokens: 722476
token throughput (completion token): 1316.833 token/s
token throughput (prompt + completion token): 3152.704 token/s
RPS (request per second): 5.468 req/s
RPM (request per minute): 328.080 req/min

with caching

concurrency: 128
elapsed_time: 507.408s

number of prompt tokens: 1007244
number of completion tokens: 722476
token throughput (completion token): 1423.856 token/s
token throughput (prompt + completion token): 3408.932 token/s
RPS (request per second): 5.912 req/s
RPM (request per minute): 354.744 req/mi

grimoire avatar Apr 18 '24 06:04 grimoire

dataset version metric mode internlm-chat-7b-turbomind internlm-chat-7b-pytorch llama-2-7b-chat-turbomind llama-2-7b-chat-pytorch internlm2-chat-7b-turbomind internlm2-chat-7b-pytorch internlm2-chat-7b-hf internlm2-chat-20b-turbomind internlm2-chat-20b-pytorch qwen-7b-chat-turbomind qwen1.5-7b-chat-pytorch qwen1.5-7b-chat-hf qwen-7b-chat-hf
--------- 考试 Exam --------- - - - - - - - - - - - - - - - -
ceval - naive_average gen 53.05 54.04 28.44 28.51 61.38 58.09 61.86 63.58 63.36 59.36 70.68 70.67 -
agieval - - - - - - - - - - - - - - - -
mmlu - naive_average gen - 52.88 35.32 35.41 63.59 57.77 56.15 67.09 59.28 57.51 61.38 61.48 -
GaokaoBench - - - - - - - - - - - - - - - -
ARC-c - - - - - - - - - - - - - - - -
--------- 语言 Language --------- - - - - - - - - - - - - - - - -
WiC d06864 accuracy gen 52.04 52.19 0 0 60.19 57.84 60.82 59.25 60.5 52.66 63.17 51.1 -
summedits - - - - - - - - - - - - - - - -
chid-dev - - - - - - - - - - - - - - - -
afqmc-dev - - - - - - - - - - - - - - - -
bustm-dev - - - - - - - - - - - - - - - -
cluewsc-dev - - - - - - - - - - - - - - - -
WSC 7902a7 accuracy gen 60.58 60.58 0 0 68.27 55.77 65.38 50 49.04 32.69 41.35 37.5 -
winogrande - - - - - - - - - - - - - - - -
flores_100 - - - - - - - - - - - - - - - -
--------- 知识 Knowledge --------- - - - - - - - - - - - - - - - -
BoolQ - - - - - - - - - - - - - - - -
commonsense_qa - - - - - - - - - - - - - - - -
nq - - - - - - - - - - - - - - - -
triviaqa 2121ce score gen 37.84 37.77 56.09 56.11 58.48 55.92 58.23 64.07 63.96 54.37 44.49 44.76 -
--------- 推理 Reasoning --------- - - - - - - - - - - - - - - - -
cmnli - - - - - - - - - - - - - - - -
ocnli - - - - - - - - - - - - - - - -
ocnli_fc-dev - - - - - - - - - - - - - - - -
AX_b - - - - - - - - - - - - - - - -
AX_g - - - - - - - - - - - - - - - -
CB - - - - - - - - - - - - - - - -
RTE - - - - - - - - - - - - - - - -
story_cloze - - - - - - - - - - - - - - - -
COPA - - - - - - - - - - - - - - - -
ReCoRD - - - - - - - - - - - - - - - -
hellaswag - - - - - - - - - - - - - - - -
piqa - - - - - - - - - - - - - - - -
siqa - - - - - - - - - - - - - - - -
strategyqa - - - - - - - - - - - - - - - -
math - - - - - - - - - - - - - - - -
gsm8k 1d7fe4 accuracy gen 34.8 34.57 28.2 27.98 71.57 37.98 45.11 75.36 68.61 55.27 48.67 55.5 -
TheoremQA - - - - - - - - - - - - - - - -
openai_humaneval - - - - - - - - - - - - - - - -
mbpp - - - - - - - - - - - - - - - -
bbh - - - - - - - - - - - - - - - -
--------- 理解 Understanding --------- - - - - - - - - - - - - - - - -
C3 - - - - - - - - - - - - - - - -
CMRC_dev - - - - - - - - - - - - - - - -
DRCD_dev - - - - - - - - - - - - - - - -
MultiRC - - - - - - - - - - - - - - - -
race-middle 9a54b6 accuracy gen 83.43 83.64 41.57 41.64 89.97 72.08 80.99 91.64 88.37 83.5 87.53 87.33 -
race-high 9a54b6 accuracy gen 78.82 78.79 39.62 39.62 85.59 72.53 78.82 87.94 84.59 77.1 82.68 82.53 -
openbookqa_fact - - - - - - - - - - - - - - - -
csl_dev - - - - - - - - - - - - - - - -
lcsts - - - - - - - - - - - - - - - -
Xsum - - - - - - - - - - - - - - - -
eprstmt-dev - - - - - - - - - - - - - - - -
lambada - - - - - - - - - - - - - - - -
tnews-dev - - - - - - - - - - - - - - - -

zhulinJulia24 avatar Apr 21 '24 03:04 zhulinJulia24

@zhulinJulia24 hi, pr_test failed with restful api. Is this failure caused by this PR?

RunningLeon avatar Apr 29 '24 13:04 RunningLeon

will merge it after v0.4.1 is released on 5.8

lvhan028 avatar Apr 30 '24 01:04 lvhan028

Hi @grimoire Looks like there is a bug in this PR

lmdeploy serve api_server \
    /path/to/Qwen \
    --server-port 23333 \
    --backend pytorch \
    --cache-max-entry-count 0.95 \
    --enable-prefix-caching \
    --max-batch-size 128 --log-level DEBUG --tp 1
image

jjjjohnson avatar May 08 '24 06:05 jjjjohnson

I am doing a code review and try to solve it.

jjjjohnson avatar May 08 '24 06:05 jjjjohnson