lmdeploy icon indicating copy to clipboard operation
lmdeploy copied to clipboard

[Docs] Compared with vllm, lmdeploy has less throughput. Is there any config wrong?

Open Green-li opened this issue 1 year ago • 7 comments

📚 The doc issue

Both of the prompt is:

Here is a conversation(date:2024-04-13) where the assistant reminds the user as their product is about to expire: ''' assistant: 哎,你好,请问你是张三吗? user: 是的。 assistant: 您好,我这边是ABC公司的,您之前在我们这边申请XX产品的试用,您还记得吗?产品的试用在明天就到期了,我这边来提醒您一下。 user: 呃,什么 ''' First, please extract the value of these keys:["user_name","company_name", "product_name", "due_date"]. After that, validate the extracted values according to the dialogue. If no value mentioned, set the value to empty string. Finally, format the number and date, and then output the result in json.

Typo: The unit in both of the screenshots is token/ms.

vLLM

import torch
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_name = "Meta-Llama-3-8B-Instruct"
model_path = f"./{model_name}"
tokenizer = AutoTokenizer.from_pretrained(model_path)
llm = LLM(model_path, dtype="bfloat16")
# warm up
llm.generate("Hello, How are you?"*50)

# generate 
prompt = """Here is a conversation(date:2024-04-13) where the assistant reminds the user as their product is about to expire:
'''
assistant: 哎,你好,请问你是张三吗?
user: 是的。
assistant: 您好,我这边是ABC公司的,您之前在我们这边申请XX产品的试用,您还记得吗?产品的试用在明天就到期了,我这边来提醒您一下。
user: 呃,什么
'''
First, please extract the value of these keys:`["user_name","company_name", "product_name", "due_date"]`.
After that, validate the extracted values according to the dialogue.
If no value mentioned, set the value to empty string.
Finally, format the number and date, and then output the result in json."""

messages = [ {"role": "user", "content": prompt} ]
formated_prompt = tokenizer.apply_chat_template(messages, tokenize=False)
sampling_params = SamplingParams(top_p=0.7, top_k=20, temperature=0.01, max_tokens=100)
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)

start_event.record()
vllm_outputs = llm.generate([f"{formated_prompt}{i}" for i in range(100)], sampling_params=sampling_params)
total_len = sum([len(tokenizer(llm_output.outputs[0].text)["input_ids"])for llm_output in vllm_outputs ])
print(total_len)
torch.cuda.synchronize()  # 等待所有CUDA操作完成
end_event.record()
print(f"运行时间: {cost_time} 毫秒, {total_len/cost_time} char/s.")

results:

  • bs=20(20 prompts): cost 3428ms, max_gpu_utl=93%
  • bs=100: cost 9135ms, max_gpu_utl=98%

screenshot on bs=100: vllm

LMDeploy

from lmdeploy import GenerationConfig, TurbomindEngineConfig, pipeline
import torch
from transformers import AutoTokenizer

# load model
model_name = "Meta-Llama-3-8B-Instruct"
model_path = f"./{model_name}"
backend_config = TurbomindEngineConfig(model_format="hf", quant_policy=8)
pipe = pipeline(model_path,
                backend_config=backend_config)
# warm up
pipe("Hello, How are you?"*50)

# generate 
prompt = """Here is a conversation(date:2024-04-13) where the assistant reminds the user to make a repayment as their loan is about to expire:
'''
assistant: 哎,你好,请问你是张三吗?
user: 是的。
assistant: 您好,我这边是ABC公司的,您之前在我们这边申请XX产品的试用,您还记得吗?产品的试用在明天就到期了,我这边来提醒您一下。
user: 呃,什么
'''
First, please extract the value of these keys:`["user_name","company_name", "product_name", "due_date"]`.
After that, validate the extracted values according to the dialogue.
If no value mentioned, set the value to empty string.
Finally, format the number and date, and then output the result in json."""
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)

start_event.record()
gen_config = GenerationConfig(top_p=0.7,
                              top_k=20,
                              temperature=0.01,
                              max_new_tokens=100)
resps = pipe([f"{prompt}{i}" for i in range(100)], gen_config=gen_config)
total_len = sum([len(tokenizer(resp.text)["input_ids"])for resp in resps ])
print(total_len)
torch.cuda.synchronize()  # 等待所有CUDA操作完成
end_event.record()
cost_time = start_event.elapsed_time(end_event)
print(f"运行时间: {cost_time} 毫秒, {total_len/cost_time} char/s.")
  • bs=20(20 prompts): cost 5278ms, max_gpu_utl=87.5%
  • bs=100: cost 25142ms, max_gpu_utl=99%

screenshot on bs=100: lmdp

Suggest a potential alternative/fix

No response

Green-li avatar Apr 25 '24 06:04 Green-li

I haven't found a 3090 yet. So I checked it with A100(80G)

lmdeploy 运行时间: 23355.064453125 毫秒, 4.3245438351374705 token/s.
vllm 运行时间: 36341.265625 毫秒, 2.7516928285295514 token/s.

I set batch_size 256 for lmdeploy and the number of prompts is set to 1000

**The larger the number of prompts is, the faster the lmdeploy performs **

Please enlarge the number of prompts at your side. I'll get back to you once I finish testing with 3090

code is attached below.

import fire
import torch

prompt = """Here is a conversation(date:2024-04-13) where the assistant reminds the user as their product is about to expire:
'''
assistant: 哎,你好,请问你是张三吗?
user: 是的。
assistant: 您好,我这边是ABC公司的,您之前在我们这边申请XX产品的试用,您还记得吗?产品的试用在明天就到期了,我这边来提醒您一下。
user: 呃,什么
'''
First, please extract the value of these keys:`["user_name","company_name", "product_name", "due_date"]`.
After that, validate the extracted values according to the dialogue.
If no value mentioned, set the value to empty string.
Finally, format the number and date, and then output the result in json."""

def benchmark_lmdeploy(model_path,
                       max_batch_size):
    from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
    pipe = pipeline(model_path=model_path,
                    backend_config=TurbomindEngineConfig(
                        max_batch_size=max_batch_size
                    ))
    gen_config = GenerationConfig(top_p=0.7,
                              top_k=20,
                              temperature=0.01,
                              max_new_tokens=100)
    
    start_event = torch.cuda.Event(enable_timing=True)
    end_event = torch.cuda.Event(enable_timing=True)
    start_event.record()

    resps = pipe([f"{prompt}{i}" for i in range(1000)], gen_config=gen_config)
    n_output_tokens = 0
    for resp in resps:
        # print(f'input_token_len: {resp.input_token_len}, '
        #       f'session_id: {resp.session_id}, '
        #       f'generate_token_len {resp.generate_token_len}')
        n_output_tokens += resp.generate_token_len
    
    torch.cuda.synchronize()  # 等待所有CUDA操作完成
    end_event.record()
    cost_time = start_event.elapsed_time(end_event)
    print(f"lmdeploy 运行时间: {cost_time} 毫秒, {n_output_tokens/cost_time} token/s.")


def benchmark_vllm(model_path):
    from vllm import LLM, SamplingParams
    from transformers import AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    llm = LLM(model_path, dtype="bfloat16")
    # warm up
    llm.generate("Hello, How are you?"*50)

    messages = [ {"role": "user", "content": prompt} ]
    formated_prompt = tokenizer.apply_chat_template(messages, tokenize=False)
    sampling_params = SamplingParams(top_p=0.7, top_k=20, temperature=0.01, max_tokens=100)
    
    start_event = torch.cuda.Event(enable_timing=True)
    end_event = torch.cuda.Event(enable_timing=True)
    start_event.record()

    vllm_outputs = llm.generate([f"{formated_prompt}{i}" for i in range(1000)], sampling_params=sampling_params)
    n_output_tokens = 0
    for output in vllm_outputs:
        # print(f'input_token_len: {len(output.prompt_token_ids)}, '
        #       f'generate_token_len {len(output.outputs[0].token_ids)}')
        n_output_tokens += len(output.outputs[0].token_ids)
    
    torch.cuda.synchronize()  # 等待所有CUDA操作完成
    end_event.record()
    cost_time = start_event.elapsed_time(end_event)
    print(f"vllm 运行时间: {cost_time} 毫秒, {n_output_tokens/cost_time} token/s.")


def main(model_path: str = '/mnt/140/llama3/Meta-Llama-3-8B-Instruct',
         backend: str = 'lmdeploy',
         max_batch_size: int = 256):
    if backend == 'lmdeploy':
        benchmark_lmdeploy(model_path,
                           max_batch_size)
    elif backend == 'vllm':
        benchmark_vllm(model_path)
    else:
        raise ValueError(f'unknown backend {backend}')
    pass

if __name__ == "__main__":
    fire.Fire(main)

lvhan028 avatar Apr 25 '24 13:04 lvhan028

pipeline is not appropriate for benchmarking lmdeploy. Try benchmark/profile_throughput.py instead

lvhan028 avatar Apr 25 '24 13:04 lvhan028

pipeline is not appropriate for benchmarking lmdeploy. Try benchmark/profile_throughput.py instead

The goal is not to run the benchmark, but to get the throughput by the regular usage, like pipeline or api(not tested). I test your script(bs=1000) on RTX 3090 24GB with the following cuda version:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Nov__3_17:16:49_PDT_2023
Cuda compilation tools, release 12.3, V12.3.103
Build cuda_12.3.r12.3/compiler.33492891_0

and the torch==2.1.2+cu121. The result is similar to yesterday:

LMDeploy

lmdp

vLLM

vllm

Maybe the lmdeploy is optimized for A100 or larger VRAM GPU?

Green-li avatar Apr 26 '24 02:04 Green-li

Let me setup the 3090 env and get back to you if there is any update

lvhan028 avatar Apr 26 '24 02:04 lvhan028

I tried several settings about max_batch_size on 3090. The result is shown below.

root@rg-X299X-AORUS-MASTER:/workspace/lmdeploy# python test.py --model-path workspace/Meta-Llama-3-8B-Instruct/ --max-batch-size 32

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

[WARNING] gemm_config.in is not found; using default GEMM algo                                                                           

lmdeploy 运行时间: 110810.5234375 毫秒, 0.9114657784011517 token/s.

root@rg-X299X-AORUS-MASTER:/workspace/lmdeploy# python test.py --model-path workspace/Meta-Llama-3-8B-Instruct/ --max-batch-size 64

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

[WARNING] gemm_config.in is not found; using default GEMM algo                                                                           

lmdeploy 运行时间: 91667.5546875 毫秒, 1.1018075080579475 token/s.

root@rg-X299X-AORUS-MASTER:/workspace/lmdeploy# python test.py --model-path workspace/Meta-Llama-3-8B-Instruct/ --max-batch-size 128

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

[WARNING] gemm_config.in is not found; using default GEMM algo                                                                           

lmdeploy 运行时间: 73957.125 毫秒, 1.3656561149449766 token/s.

BTW, GPU utility is 99-100% all the time. The mem for each setting is: bs = 32, mem 22956M bs = 64, mem 23340M bs = 128, mem 23724M

lvhan028 avatar Apr 26 '24 04:04 lvhan028

The implementation of batch_infer of the pipeline is not optimized. It divides the prompt list into batches and processes the batch sequentially. Between two adjacent batches, the GPU is not fully occupied. We are going to optimize the API. stay tuned

lvhan028 avatar Apr 26 '24 09:04 lvhan028

PR #1507 is dealing with it

lvhan028 avatar Apr 26 '24 14:04 lvhan028

The inference pipeline is optimized in v0.4.1 which is released today. May try this version for better inference performance of the pipline

lvhan028 avatar May 07 '24 11:05 lvhan028