lmdeploy
lmdeploy copied to clipboard
[Docs] Compared with vllm, lmdeploy has less throughput. Is there any config wrong?
📚 The doc issue
Both of the prompt is:
Here is a conversation(date:2024-04-13) where the assistant reminds the user as their product is about to expire: ''' assistant: 哎,你好,请问你是张三吗? user: 是的。 assistant: 您好,我这边是ABC公司的,您之前在我们这边申请XX产品的试用,您还记得吗?产品的试用在明天就到期了,我这边来提醒您一下。 user: 呃,什么 ''' First, please extract the value of these keys:
["user_name","company_name", "product_name", "due_date"]. After that, validate the extracted values according to the dialogue. If no value mentioned, set the value to empty string. Finally, format the number and date, and then output the result in json.
Typo: The unit in both of the screenshots is token/ms.
vLLM
import torch
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
model_name = "Meta-Llama-3-8B-Instruct"
model_path = f"./{model_name}"
tokenizer = AutoTokenizer.from_pretrained(model_path)
llm = LLM(model_path, dtype="bfloat16")
# warm up
llm.generate("Hello, How are you?"*50)
# generate
prompt = """Here is a conversation(date:2024-04-13) where the assistant reminds the user as their product is about to expire:
'''
assistant: 哎,你好,请问你是张三吗?
user: 是的。
assistant: 您好,我这边是ABC公司的,您之前在我们这边申请XX产品的试用,您还记得吗?产品的试用在明天就到期了,我这边来提醒您一下。
user: 呃,什么
'''
First, please extract the value of these keys:`["user_name","company_name", "product_name", "due_date"]`.
After that, validate the extracted values according to the dialogue.
If no value mentioned, set the value to empty string.
Finally, format the number and date, and then output the result in json."""
messages = [ {"role": "user", "content": prompt} ]
formated_prompt = tokenizer.apply_chat_template(messages, tokenize=False)
sampling_params = SamplingParams(top_p=0.7, top_k=20, temperature=0.01, max_tokens=100)
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
vllm_outputs = llm.generate([f"{formated_prompt}{i}" for i in range(100)], sampling_params=sampling_params)
total_len = sum([len(tokenizer(llm_output.outputs[0].text)["input_ids"])for llm_output in vllm_outputs ])
print(total_len)
torch.cuda.synchronize() # 等待所有CUDA操作完成
end_event.record()
print(f"运行时间: {cost_time} 毫秒, {total_len/cost_time} char/s.")
results:
- bs=20(20 prompts): cost
3428ms, max_gpu_utl=93% - bs=100: cost
9135ms, max_gpu_utl=98%
screenshot on bs=100:
LMDeploy
from lmdeploy import GenerationConfig, TurbomindEngineConfig, pipeline
import torch
from transformers import AutoTokenizer
# load model
model_name = "Meta-Llama-3-8B-Instruct"
model_path = f"./{model_name}"
backend_config = TurbomindEngineConfig(model_format="hf", quant_policy=8)
pipe = pipeline(model_path,
backend_config=backend_config)
# warm up
pipe("Hello, How are you?"*50)
# generate
prompt = """Here is a conversation(date:2024-04-13) where the assistant reminds the user to make a repayment as their loan is about to expire:
'''
assistant: 哎,你好,请问你是张三吗?
user: 是的。
assistant: 您好,我这边是ABC公司的,您之前在我们这边申请XX产品的试用,您还记得吗?产品的试用在明天就到期了,我这边来提醒您一下。
user: 呃,什么
'''
First, please extract the value of these keys:`["user_name","company_name", "product_name", "due_date"]`.
After that, validate the extracted values according to the dialogue.
If no value mentioned, set the value to empty string.
Finally, format the number and date, and then output the result in json."""
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
gen_config = GenerationConfig(top_p=0.7,
top_k=20,
temperature=0.01,
max_new_tokens=100)
resps = pipe([f"{prompt}{i}" for i in range(100)], gen_config=gen_config)
total_len = sum([len(tokenizer(resp.text)["input_ids"])for resp in resps ])
print(total_len)
torch.cuda.synchronize() # 等待所有CUDA操作完成
end_event.record()
cost_time = start_event.elapsed_time(end_event)
print(f"运行时间: {cost_time} 毫秒, {total_len/cost_time} char/s.")
- bs=20(20 prompts): cost
5278ms, max_gpu_utl=87.5% - bs=100: cost
25142ms, max_gpu_utl=99%
screenshot on bs=100:
Suggest a potential alternative/fix
No response
I haven't found a 3090 yet. So I checked it with A100(80G)
lmdeploy 运行时间: 23355.064453125 毫秒, 4.3245438351374705 token/s.
vllm 运行时间: 36341.265625 毫秒, 2.7516928285295514 token/s.
I set batch_size 256 for lmdeploy and the number of prompts is set to 1000
**The larger the number of prompts is, the faster the lmdeploy performs **
Please enlarge the number of prompts at your side. I'll get back to you once I finish testing with 3090
code is attached below.
import fire
import torch
prompt = """Here is a conversation(date:2024-04-13) where the assistant reminds the user as their product is about to expire:
'''
assistant: 哎,你好,请问你是张三吗?
user: 是的。
assistant: 您好,我这边是ABC公司的,您之前在我们这边申请XX产品的试用,您还记得吗?产品的试用在明天就到期了,我这边来提醒您一下。
user: 呃,什么
'''
First, please extract the value of these keys:`["user_name","company_name", "product_name", "due_date"]`.
After that, validate the extracted values according to the dialogue.
If no value mentioned, set the value to empty string.
Finally, format the number and date, and then output the result in json."""
def benchmark_lmdeploy(model_path,
max_batch_size):
from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
pipe = pipeline(model_path=model_path,
backend_config=TurbomindEngineConfig(
max_batch_size=max_batch_size
))
gen_config = GenerationConfig(top_p=0.7,
top_k=20,
temperature=0.01,
max_new_tokens=100)
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
resps = pipe([f"{prompt}{i}" for i in range(1000)], gen_config=gen_config)
n_output_tokens = 0
for resp in resps:
# print(f'input_token_len: {resp.input_token_len}, '
# f'session_id: {resp.session_id}, '
# f'generate_token_len {resp.generate_token_len}')
n_output_tokens += resp.generate_token_len
torch.cuda.synchronize() # 等待所有CUDA操作完成
end_event.record()
cost_time = start_event.elapsed_time(end_event)
print(f"lmdeploy 运行时间: {cost_time} 毫秒, {n_output_tokens/cost_time} token/s.")
def benchmark_vllm(model_path):
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)
llm = LLM(model_path, dtype="bfloat16")
# warm up
llm.generate("Hello, How are you?"*50)
messages = [ {"role": "user", "content": prompt} ]
formated_prompt = tokenizer.apply_chat_template(messages, tokenize=False)
sampling_params = SamplingParams(top_p=0.7, top_k=20, temperature=0.01, max_tokens=100)
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
vllm_outputs = llm.generate([f"{formated_prompt}{i}" for i in range(1000)], sampling_params=sampling_params)
n_output_tokens = 0
for output in vllm_outputs:
# print(f'input_token_len: {len(output.prompt_token_ids)}, '
# f'generate_token_len {len(output.outputs[0].token_ids)}')
n_output_tokens += len(output.outputs[0].token_ids)
torch.cuda.synchronize() # 等待所有CUDA操作完成
end_event.record()
cost_time = start_event.elapsed_time(end_event)
print(f"vllm 运行时间: {cost_time} 毫秒, {n_output_tokens/cost_time} token/s.")
def main(model_path: str = '/mnt/140/llama3/Meta-Llama-3-8B-Instruct',
backend: str = 'lmdeploy',
max_batch_size: int = 256):
if backend == 'lmdeploy':
benchmark_lmdeploy(model_path,
max_batch_size)
elif backend == 'vllm':
benchmark_vllm(model_path)
else:
raise ValueError(f'unknown backend {backend}')
pass
if __name__ == "__main__":
fire.Fire(main)
pipeline is not appropriate for benchmarking lmdeploy.
Try benchmark/profile_throughput.py instead
pipelineis not appropriate for benchmarking lmdeploy. Trybenchmark/profile_throughput.pyinstead
The goal is not to run the benchmark, but to get the throughput by the regular usage, like pipeline or api(not tested). I test your script(bs=1000) on RTX 3090 24GB with the following cuda version:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Nov__3_17:16:49_PDT_2023
Cuda compilation tools, release 12.3, V12.3.103
Build cuda_12.3.r12.3/compiler.33492891_0
and the torch==2.1.2+cu121. The result is similar to yesterday:
LMDeploy
vLLM
Maybe the lmdeploy is optimized for A100 or larger VRAM GPU?
Let me setup the 3090 env and get back to you if there is any update
I tried several settings about max_batch_size on 3090. The result is shown below.
root@rg-X299X-AORUS-MASTER:/workspace/lmdeploy# python test.py --model-path workspace/Meta-Llama-3-8B-Instruct/ --max-batch-size 32
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING] gemm_config.in is not found; using default GEMM algo
lmdeploy 运行时间: 110810.5234375 毫秒, 0.9114657784011517 token/s.
root@rg-X299X-AORUS-MASTER:/workspace/lmdeploy# python test.py --model-path workspace/Meta-Llama-3-8B-Instruct/ --max-batch-size 64
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING] gemm_config.in is not found; using default GEMM algo
lmdeploy 运行时间: 91667.5546875 毫秒, 1.1018075080579475 token/s.
root@rg-X299X-AORUS-MASTER:/workspace/lmdeploy# python test.py --model-path workspace/Meta-Llama-3-8B-Instruct/ --max-batch-size 128
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING] gemm_config.in is not found; using default GEMM algo
lmdeploy 运行时间: 73957.125 毫秒, 1.3656561149449766 token/s.
BTW, GPU utility is 99-100% all the time. The mem for each setting is: bs = 32, mem 22956M bs = 64, mem 23340M bs = 128, mem 23724M
The implementation of batch_infer of the pipeline is not optimized.
It divides the prompt list into batches and processes the batch sequentially.
Between two adjacent batches, the GPU is not fully occupied.
We are going to optimize the API. stay tuned
PR #1507 is dealing with it
The inference pipeline is optimized in v0.4.1 which is released today. May try this version for better inference performance of the pipline