lmdeploy [Bug] relatively slow speed after deploy InternVL2-26B

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[ ] 2. The bug has not been fixed in the latest version.
[X] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

I have searched for relevant questions. Is the speed relatively slow because lmdeploy does not optimize the vision model, thus slowing down the entire request time? I want to know lmdeploy can support vision models multiprocess images or batch images

Reproduction

# start server
lmdeploy serve api_server --cache-max-entry-count 0.6 InternVL2-26B/ --server-port 23333

# 4 concurrency 100 prompts without image url 
python profile_restful_api_image.py http://127.0.0.1:23333 InternVL2-26B/  HC3-Chinese/all.jsonl --stream_output true --concurrency 4 --num_prompts 100

# res
--------------------------------------------------
concurrency: 4
elapsed_time: 69.157s

first_token latency(min, max, ave): 0.039s, 0.162s, 0.063s

number of prompt tokens: 1787
number of completion tokens: 10001
token throughput (completion token): 144.612 token/s
token throughput (prompt + completion token): 170.452 token/s
RPS (request per second): 1.446 req/s
RPM (request per minute): 86.759 req/min
--------------------------------------------------

# 4 concurrency 100 prompts with image url 
python profile_restful_api_image.py http://127.0.0.1:23333 InternVL2-26B/  HC3-Chinese/all.jsonl --stream_output true --concurrency 4 --num_prompts 100 --use_image true
# res

--------------------------------------------------
concurrency: 4
elapsed_time: 160.245s

first_token latency(min, max, ave): 1.065s, 3.772s, 1.390s

number of prompt tokens: 1787
number of completion tokens: 9932
token throughput (completion token): 61.980 token/s
token throughput (prompt + completion token): 73.132 token/s
RPS (request per second): 0.624 req/s
RPM (request per minute): 37.443 req/min
--------------------------------------------------

Environment

sys.platform: linux
Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0: NVIDIA A100-SXM4-80GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.5, V12.5.40
GCC: x86_64-linux-gnu-gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
PyTorch: 2.2.2+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 

TorchVision: 0.17.2+cu121
LMDeploy: 0.5.1+
transformers: 4.37.2
gradio: 4.39.0
fastapi: 0.110.0
pydantic: 2.6.3
triton: 2.2.0
NVIDIA Topology: 
        GPU0    NIC0    NIC1    NIC2    NIC3    NIC4    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NODE    PXB     SYS     SYS     NODE    0-23,48-71      0               N/A
NIC0    NODE     X      NODE    SYS     SYS     NODE
NIC1    PXB     NODE     X      SYS     SYS     NODE
NIC2    SYS     SYS     SYS      X      NODE    SYS
NIC3    SYS     SYS     SYS     NODE     X      SYS
NIC4    NODE    NODE    NODE    SYS     SYS      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_2
  NIC1: mlx5_3
  NIC2: mlx5_4
  NIC3: mlx5_5
  NIC4: mlx5_bond_0

Error traceback

No response

Jul 25 '24 11:07 LatentLinker

You can set --vision-max-batch-size when you start the server. But there is small probability of forming a batch except there are multiple images in one request or the interval between two requests is very small. Or maybe we could await a little more time here

With the pytorch engine, the time increases almost linearly with the batch size grows especially when the vit model is large.

BTW, can you share the file of profile_restful_api_image.py ?

Jul 25 '24 11:07 irexyc

@irexyc Thank you for your reply. this is profile_restful_api_image.py I'll try --vision-max-batch-size later

import csv
import json
import random
import time
from queue import Queue
from threading import Thread
from typing import List, Optional, Tuple

import fire
import numpy as np
from tqdm import tqdm
from transformers import AutoTokenizer

from lmdeploy.serve.openai.api_client import APIClient


def sample_requests(
    dataset_path: str,
    num_requests: int,
    tokenizer: AutoTokenizer,
) -> List[Tuple[str, int, int]]:
    # Load the dataset.
    dataset = []
    for line in open(dataset_path):
        dataset.append(json.loads(line))

    # Tokenize the prompts and completions.
    prompts = [prompt["question"] for prompt in dataset]
    prompt_token_ids = tokenizer(prompts).input_ids
    completions = [prompt["chatgpt_answers"][0] for prompt in dataset]
    completion_token_ids = tokenizer(completions).input_ids
    tokenized_dataset = []
    for i in range(len(dataset)):
        max_tokens = len(completion_token_ids[i])
        prompt_len = len(prompt_token_ids[i])
        if prompt_len < 4 or max_tokens < 4:
            # Prune too short sequences.
            continue
        if prompt_len > 1024 or prompt_len + max_tokens > 2048:
            # Prune too long sequences.
            continue
        tokenized_dataset.append((prompts[i], prompt_len, max_tokens))
    if len(tokenized_dataset) < num_requests:
        b = num_requests // len(tokenized_dataset)
        tokenized_dataset += tokenized_dataset * b
    sampled_requests = random.sample(tokenized_dataset, num_requests)
    random.shuffle(sampled_requests)
    return sampled_requests


class Engine:

    def __init__(self,
                 server_addr: str,
                 tokenzier_path: str,
                 temperature: float = 0.8,
                 top_p: float = 1.0,
                 csv: str = '',
                 api_key: Optional[str] = None,
                 model_name: Optional[str] = None,
                 use_image: Optional[bool] = False,
                 **kwargs):
        self.tokenizer = AutoTokenizer.from_pretrained(tokenzier_path,
                                                       trust_remote_code=True)
        self.server_addr = server_addr
        self.temperature = temperature
        self.top_p = top_p
        self.csv = csv
        self.api_key = api_key
        self.use_image = use_image
        client = APIClient(self.server_addr, api_key=self.api_key)
        if model_name is None:
            self.model_name = client.available_models[0]
            print(f'using model: {self.model_name}\n')
        else:
            self.model_name = model_name
        self.pbar = None

    def _inference(self, req_queue: Queue, res_queue: Queue, session_id: int,
                   stream_output: bool):

        stats = []
        client = APIClient(self.server_addr, api_key=self.api_key)

        for prompt, input_seqlen, max_tokens in iter(
                req_queue.get, [None, None, None]):
            timestamps = []
            timestamps.append(time.perf_counter())
            if self.use_image:
                messages = [{"role": "user", 'content': [
                    {
                        'type': 'text',
                        'text': prompt,
                    },
                    {
                        'type': 'image_url',
                        'image_url': {
                            'url': "https://img1.baidu.com/it/u=4157744492,1349578166&fm=253&fmt=auto&app=120&f=JPEG?w=500&h=750"
                        },
                    }
                ]}]
            else:
                messages = [{"role": "user", 'content': [
                    {
                        'type': 'text',
                        'text': prompt,
                    },
                ]}]
            answer = ""
            for output in client.chat_completions_v1(
                    model=self.model_name,
                    messages=messages,
                    temperature=self.temperature,
                    top_p=self.top_p,
                    n=1,
                    stream=stream_output,
                    max_tokens=max_tokens,
                    session_id=session_id,
                    ignore_eos=True):
                answer += output["choices"][0]["delta"]["content"]
                timestamps.append(time.perf_counter())
            output_seqlen = len(self.tokenizer(answer).input_ids)
            first_token_latency = np.round(timestamps[1] - timestamps[0], 3)
            token_latency = np.round(timestamps[-1] - timestamps[0], 3)
            # assert output.pop('finish_reason') == 'length', \
            #     f'Error. session_id({session_id}) request {output_seqlen} ' \
            #     f'tokens, but `finish_reason` is not `length`'
            total_tokens = input_seqlen + output_seqlen
            stats.append([
                first_token_latency, output_seqlen, output_seqlen,
                total_tokens, token_latency
            ])
            self.pbar.update(1)

        res_queue.put((session_id, stats))

    def process_request(self,
                        requests,
                        concurrency: int = 1,
                        stream_output: bool = False):
        res_queue = Queue()
        req_queue = Queue()
        threads = []

        self.pbar = tqdm(total=len(requests))

        # feed request to q
        for req in requests:
            req_queue.put(req)
        for i in range(concurrency):
            req_queue.put([None, None, None])

        start = time.time()

        # start threads
        for i in range(concurrency):
            t = Thread(target=self._inference,
                       args=(req_queue, res_queue, i, stream_output))
            t.start()
            threads.append(t)

        # wait for finish
        for t in threads:
            t.join()

        elapsed_time = time.time() - start

        stats = []
        while not res_queue.empty():
            session_id, _stats = res_queue.get()
            if len(_stats) != 0:
                stats.append(np.array(_stats))

        stats = np.concatenate(stats).reshape(-1, 5)

        first_token_latency_min = np.min(stats[:, 0], axis=0)
        first_token_latency_max = np.max(stats[:, 0], axis=0)
        first_token_latency_ave = np.mean(stats[:, 0], axis=0)
        completion_tokens = np.sum(stats[:, 1], axis=0)
        request_output_tokens = np.sum(stats[:, 2], axis=0)
        total_tokens = np.sum(stats[:, 3], axis=0)
        prompt_tokens = total_tokens - completion_tokens
        completion_token_throughput = completion_tokens / elapsed_time
        total_token_throughput = total_tokens / elapsed_time
        rps = len(requests) / elapsed_time
        rpm = rps * 60

        if (np.abs(stats[:, 1] - stats[:, 2]) <= 1).min() is False:
            print(f'Did not generate requested number of tokens. '
                  f'Request {request_output_tokens:.0f}, '
                  f'but got {completion_tokens:.0f}')

        print(f'\n{"-" * 50}\nconcurrency: {concurrency}\n'
              f'elapsed_time: {elapsed_time:.3f}s\n')
        if stream_output:
            print(f'first_token latency(min, max, ave): '
                  f'{first_token_latency_min:.3f}s, '
                  f'{first_token_latency_max:.3f}s, '
                  f'{first_token_latency_ave:.3f}s\n')
        print(
            f'number of prompt tokens: {prompt_tokens:.0f}\n'
            f'number of completion tokens: {completion_tokens:.0f}\n'
            f'token throughput (completion token): {completion_token_throughput:.3f} token/s\n'  # noqa
            f'token throughput (prompt + completion token): {total_token_throughput:.3f} token/s\n'  # noqa
            f'RPS (request per second): {rps:.3f} req/s\n'
            f'RPM (request per minute): {rpm:.3f} req/min\n'
            f'{"-" * 50}\n')

        if self.csv:
            with open(self.csv, 'w') as csvfile:
                writer = csv.writer(csvfile)
                writer.writerow([
                    'batch', 'num_prompts', 'RPS', 'RPM', 'FTL(ave)(s)',
                    'FTL(min)(s)', 'FTL(max)(s)', 'throughput(out tok/s)',
                    'throughput(total tok/s)'
                ])
                writer.writerow([
                    concurrency,
                    len(requests), f'{rps:.3f}', f'{rpm:.3f}',
                    f'{first_token_latency_ave:.3f}' if stream_output else '-',
                    f'{first_token_latency_min:.3f}' if stream_output else '-',
                    f'{first_token_latency_max:.3f}' if stream_output else '-',
                    f'{completion_token_throughput:.3f}',
                    f'{total_token_throughput:.3f}'
                ])


def main(server_addr: str,
         tokenizer_path: str,
         dataset: str,
         api_key: Optional[str] = None,
         model_name: Optional[str] = None,
         concurrency: int = 128,
         num_prompts: int = 5000,
         top_p: float = 1.0,
         temperature: float = 1.0,
         stream_output: bool = False,
         csv: str = './profile_api_server.csv',
         seed: int = 0,
         use_image: bool = False):
    """Benchmark the request througput of api server.

    Args:
        server_addr (str): http url of api_server with format http://0.0.0.0:0
        tokenizer_path (str): Path to the tokenizer model in localhost
        dataset (str): Path to the dataset
        concurrency (int, optional): Number of working threads to process the sampled prompts.
            Defaults to 128.
        num_prompts (int, optional): Number of prompts to process. Defaults to 5000.
        top_p (float, optional): the set of most probable tokens with
            probabilities that add up to top_p or higher
            are kept for generation. Defaults to 1.0.
        temperature (float, optional): The value used to modulate the next token probabilities.
            Defaults to 1.0.
        stream_output (bool, optional): Indicator for streaming output. Defaults to False.
        csv (str, optional): The path to save the result.
        seed (int, optional): Seed used in sampling prompts from dataset. Defaults to 0.
        use_image (bool, optional): whether to add image parameters. Defaults to False.
    """    # noqa
    if not server_addr.startswith('http://'):
        print(f'[WARNING] server_addr of the api_server should '
              f'start with "http://", but got "{server_addr}"')
        server_addr = 'http://' + server_addr.strip()

    random.seed(seed)

    engine = Engine(server_addr,
                    tokenizer_path,
                    top_p=top_p,
                    temperature=temperature,
                    csv=csv,
                    api_key=api_key,
                    model_name=model_name,
                    use_image=use_image)

    requests = sample_requests(dataset, num_prompts, engine.tokenizer)

    engine.process_request(requests, concurrency, stream_output)


if __name__ == '__main__':
    fire.Fire(main)

Jul 25 '24 12:07 LatentLinker

@irexyc This is how I feel after using '--vision-max-batch-size' as I would have without it

lmdeploy serve api_server --cache-max-entry-count 0.6 /home/notebook/data/personal/W9088934/InternVL2-26B/  --server-port 23333 --vision-max-batch-size 8


python profile_restful_api_image.py http://127.0.0.1:23333 /home/notebook/data/personal/W9088934/InternVL2-26B  /home/notebook/data/personal/W9088934/datasets/HC3-Chinese/all.jsonl --stream_output true --concurrency 4 --num_prompts 100 --use_image true

--------------------------------------------------
concurrency: 4
elapsed_time: 157.052s

first_token latency(min, max, ave): 1.028s, 4.047s, 1.282s

number of prompt tokens: 1787
number of completion tokens: 9956
token throughput (completion token): 63.393 token/s
token throughput (prompt + completion token): 74.771 token/s
RPS (request per second): 0.637 req/s
RPM (request per minute): 38.204 req/min
--------------------------------------------------

Jul 25 '24 12:07 LatentLinker

For this test script, the --vision-max-batch-size almost takes no effect as each thread will rarely send requests at the same time except for the first request.

BTW, tt's better to use base64 format of image as it can remove the image downloading time. You can refer to this

Jul 26 '24 03:07 irexyc

@irexyc According to your comments I modified the test script and added it using the startup service --log-level INFO

import asyncio
import base64
import csv
import json
import random
import time
# from queue import Queue
from asyncio import Queue
from typing import List, Optional, Tuple

import fire
import numpy as np
from tqdm import tqdm
from transformers import AutoTokenizer

from openai import AsyncOpenAI

with open("example.png", "rb") as f:
    bs64_img = f'data:image/jpeg;base64,{base64.b64encode(f.read()).decode("utf-8")}'


def sample_requests(
        dataset_path: str,
        num_requests: int,
        tokenizer: AutoTokenizer,
) -> List[Tuple[str, int, int]]:
    # Load the dataset.
    dataset = []
    for line in open(dataset_path):
        dataset.append(json.loads(line))

    # Tokenize the prompts and completions.
    prompts = [prompt["question"] for prompt in dataset]
    prompt_token_ids = tokenizer(prompts).input_ids
    completions = [prompt["chatgpt_answers"][0] for prompt in dataset]
    completion_token_ids = tokenizer(completions).input_ids
    tokenized_dataset = []
    for i in range(len(dataset)):
        max_tokens = len(completion_token_ids[i])
        prompt_len = len(prompt_token_ids[i])
        if prompt_len < 4 or max_tokens < 4:
            # Prune too short sequences.
            continue
        if prompt_len > 1024 or prompt_len + max_tokens > 2048:
            # Prune too long sequences.
            continue
        tokenized_dataset.append((prompts[i], prompt_len, max_tokens))
    if len(tokenized_dataset) < num_requests:
        b = num_requests // len(tokenized_dataset)
        tokenized_dataset += tokenized_dataset * b
    sampled_requests = random.sample(tokenized_dataset, num_requests)
    random.shuffle(sampled_requests)
    return sampled_requests


class Engine:

    def __init__(self,
                 server_addr: str,
                 tokenzier_path: str,
                 temperature: float = 0.8,
                 top_p: float = 1.0,
                 csv: str = '',
                 api_key: Optional[str] = None,
                 model_name: Optional[str] = None,
                 use_image: Optional[bool] = False,
                 **kwargs):
        self.tokenizer = AutoTokenizer.from_pretrained(tokenzier_path,
                                                       trust_remote_code=True)
        self.server_addr = server_addr
        self.temperature = temperature
        self.top_p = top_p
        self.csv = csv
        self.api_key = api_key
        self.use_image = use_image
        self.pbar = None
        self.model_name = model_name

    async def set_model_name(self, model_name):
        if model_name is None:
            client = AsyncOpenAI(api_key='YOUR_API_KEY', base_url=self.server_addr)
            models = await client.models.list()
            self.model_name = models.data[0].id
            print(f'using model: {self.model_name}\n')
        else:
            self.model_name = model_name

    async def _inference(self, data: Tuple[str, int, int], res_queue: Queue, session_id: int,
                         stream_output: bool):

        stats = []
        client = AsyncOpenAI(api_key='YOUR_API_KEY', base_url=self.server_addr)

        prompt, input_seqlen, max_tokens = data
        timestamps = []
        timestamps.append(time.perf_counter())
        if self.use_image:
            messages = [{"role": "user", 'content': [
                {
                    'type': 'text',
                    'text': prompt,
                },
                {
                    'type': 'image_url',
                    'image_url': {
                        'url': bs64_img
                    },
                }
            ]}]
        else:
            messages = [{"role": "user", 'content': [
                {
                    'type': 'text',
                    'text': prompt,
                },
            ]}]
        response = await client.chat.completions.create(
            model=self.model_name,
            messages=messages,
            temperature=self.temperature,
            top_p=self.top_p,
            n=1,
            stream=stream_output,
            max_tokens=max_tokens,
        )
        answer = ""
        async for i in response:
            delta = i.choices[0].delta.content
            answer += delta
            timestamps.append(time.perf_counter())
        output_seqlen = len(self.tokenizer(answer).input_ids)
        first_token_latency = np.round(timestamps[1] - timestamps[0], 3)
        token_latency = np.round(timestamps[-1] - timestamps[0], 3)
        total_tokens = input_seqlen + output_seqlen
        stats.append([
            first_token_latency, output_seqlen, output_seqlen,
            total_tokens, token_latency
        ])
        self.pbar.update(1)

        await res_queue.put((session_id, stats))

    async def process_request(self,
                              requests,
                              concurrency: int = 1,
                              stream_output: bool = False):
        res_queue = Queue()

        self.pbar = tqdm(total=len(requests))
        start = time.time()

        workers = []

        for i, data in enumerate(requests, start=1):
            workers.append(self._inference(data, res_queue, i % concurrency, stream_output))
            if i % concurrency == 0:
                await asyncio.gather(*workers)
                workers.clear()

        elapsed_time = time.time() - start

        stats = []
        while not res_queue.empty():
            session_id, _stats = await res_queue.get()
            if len(_stats) != 0:
                stats.append(np.array(_stats))

        stats = np.concatenate(stats).reshape(-1, 5)

        first_token_latency_min = np.min(stats[:, 0], axis=0)
        first_token_latency_max = np.max(stats[:, 0], axis=0)
        first_token_latency_ave = np.mean(stats[:, 0], axis=0)
        completion_tokens = np.sum(stats[:, 1], axis=0)
        request_output_tokens = np.sum(stats[:, 2], axis=0)
        total_tokens = np.sum(stats[:, 3], axis=0)
        prompt_tokens = total_tokens - completion_tokens
        completion_token_throughput = completion_tokens / elapsed_time
        total_token_throughput = total_tokens / elapsed_time
        rps = len(requests) / elapsed_time
        rpm = rps * 60

        if (np.abs(stats[:, 1] - stats[:, 2]) <= 1).min() is False:
            print(f'Did not generate requested number of tokens. '
                  f'Request {request_output_tokens:.0f}, '
                  f'but got {completion_tokens:.0f}')

        print(f'\n{"-" * 50}\nconcurrency: {concurrency}\n'
              f'elapsed_time: {elapsed_time:.3f}s\n')
        if stream_output:
            print(f'first_token latency(min, max, ave): '
                  f'{first_token_latency_min:.3f}s, '
                  f'{first_token_latency_max:.3f}s, '
                  f'{first_token_latency_ave:.3f}s\n')
        print(
            f'number of prompt tokens: {prompt_tokens:.0f}\n'
            f'number of completion tokens: {completion_tokens:.0f}\n'
            f'token throughput (completion token): {completion_token_throughput:.3f} token/s\n'  # noqa
            f'token throughput (prompt + completion token): {total_token_throughput:.3f} token/s\n'  # noqa
            f'RPS (request per second): {rps:.3f} req/s\n'
            f'RPM (request per minute): {rpm:.3f} req/min\n'
            f'{"-" * 50}\n')

        if self.csv:
            with open(self.csv, 'w') as csvfile:
                writer = csv.writer(csvfile)
                writer.writerow([
                    'batch', 'num_prompts', 'RPS', 'RPM', 'FTL(ave)(s)',
                    'FTL(min)(s)', 'FTL(max)(s)', 'throughput(out tok/s)',
                    'throughput(total tok/s)'
                ])
                writer.writerow([
                    concurrency,
                    len(requests), f'{rps:.3f}', f'{rpm:.3f}',
                    f'{first_token_latency_ave:.3f}' if stream_output else '-',
                    f'{first_token_latency_min:.3f}' if stream_output else '-',
                    f'{first_token_latency_max:.3f}' if stream_output else '-',
                    f'{completion_token_throughput:.3f}',
                    f'{total_token_throughput:.3f}'
                ])


async def start(engine, model_name, dataset, num_prompts, concurrency, stream_output):
    await engine.set_model_name(model_name)

    requests = sample_requests(dataset, num_prompts, engine.tokenizer)

    await engine.process_request(requests, concurrency, stream_output)


def main(server_addr: str,
         tokenizer_path: str,
         dataset: str,
         api_key: Optional[str] = None,
         model_name: Optional[str] = None,
         concurrency: int = 128,
         num_prompts: int = 5000,
         top_p: float = 1.0,
         temperature: float = 1.0,
         stream_output: bool = False,
         csv: str = './profile_api_server.csv',
         seed: int = 0,
         use_image: bool = False):
    """Benchmark the request througput of api server.

    Args:
        server_addr (str): http url of api_server with format http://0.0.0.0:0
        tokenizer_path (str): Path to the tokenizer model in localhost
        dataset (str): Path to the dataset
        concurrency (int, optional): Number of working threads to process the sampled prompts.
            Defaults to 128.
        num_prompts (int, optional): Number of prompts to process. Defaults to 5000.
        top_p (float, optional): the set of most probable tokens with
            probabilities that add up to top_p or higher
            are kept for generation. Defaults to 1.0.
        temperature (float, optional): The value used to modulate the next token probabilities.
            Defaults to 1.0.
        stream_output (bool, optional): Indicator for streaming output. Defaults to False.
        csv (str, optional): The path to save the result.
        seed (int, optional): Seed used in sampling prompts from dataset. Defaults to 0.
        use_image (bool, optional): whether to add image parameters. Defaults to False.
    """  # noqa
    if not server_addr.startswith('http://'):
        print(f'[WARNING] server_addr of the api_server should '
              f'start with "http://", but got "{server_addr}"')
        server_addr = 'http://' + server_addr.strip()

    random.seed(seed)

    engine = Engine(server_addr,
                    tokenizer_path,
                    top_p=top_p,
                    temperature=temperature,
                    csv=csv,
                    api_key=api_key,
                    model_name=model_name,
                    use_image=use_image)
    asyncio.run(start(engine, model_name, dataset, num_prompts, concurrency, stream_output))


if __name__ == '__main__':
    fire.Fire(main)

Then I see the batch image log. It feels like there's a linear relationship between time and number of pictures. it is not quite effective.

lmdeploy - INFO - ImageEncoder forward 2 images, cost 1.700s
lmdeploy - INFO - ImageEncoder forward 2 images, cost 2.131s
lmdeploy - INFO - ImageEncoder forward 2 images, cost 1.441s
lmdeploy - INFO - ImageEncoder forward 2 images, cost 2.028s
lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.738s
lmdeploy - INFO - ImageEncoder forward 3 images, cost 2.357s
lmdeploy - INFO - ImageEncoder forward 2 images, cost 1.437s
lmdeploy - INFO - ImageEncoder forward 2 images, cost 1.982s
lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.752s
lmdeploy - INFO - ImageEncoder forward 3 images, cost 2.369s
lmdeploy - INFO - ImageEncoder forward 2 images, cost 1.450s
lmdeploy - INFO - ImageEncoder forward 2 images, cost 1.984s
lmdeploy - INFO - ImageEncoder forward 4 images, cost 2.844s
lmdeploy - INFO - ImageEncoder forward 3 images, cost 2.155s
lmdeploy - INFO - ImageEncoder forward 1 images, cost 1.577s
lmdeploy - INFO - ImageEncoder forward 2 images, cost 1.466s
lmdeploy - INFO - ImageEncoder forward 2 images, cost 1.977s
lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.759s
lmdeploy - INFO - ImageEncoder forward 3 images, cost 2.386s
lmdeploy - INFO - ImageEncoder forward 2 images, cost 1.457s
lmdeploy - INFO - ImageEncoder forward 2 images, cost 2.003s
lmdeploy - INFO - ImageEncoder forward 2 images, cost 1.443s
lmdeploy - INFO - ImageEncoder forward 2 images, cost 1.978s
lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.743s
lmdeploy - INFO - ImageEncoder forward 3 images, cost 2.353s
lmdeploy - INFO - ImageEncoder forward 2 images, cost 1.452s
lmdeploy - INFO - ImageEncoder forward 2 images, cost 1.974s
lmdeploy - INFO - ImageEncoder forward 4 images, cost 2.848s
lmdeploy - INFO - ImageEncoder forward 3 images, cost 2.133s
lmdeploy - INFO - ImageEncoder forward 1 images, cost 1.571s
lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.752s
lmdeploy - INFO - ImageEncoder forward 3 images, cost 2.369s
lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.731s
lmdeploy - INFO - ImageEncoder forward 3 images, cost 2.363s
lmdeploy - INFO - ImageEncoder forward 2 images, cost 1.452s
lmdeploy - INFO - ImageEncoder forward 2 images, cost 1.989s
lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.729s
lmdeploy - INFO - ImageEncoder forward 3 images, cost 2.365s
lmdeploy - INFO - ImageEncoder forward 2 images, cost 1.444s
lmdeploy - INFO - ImageEncoder forward 2 images, cost 1.985s
lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.729s
lmdeploy - INFO - ImageEncoder forward 3 images, cost 2.362s
lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.742s
lmdeploy - INFO - ImageEncoder forward 3 images, cost 2.386s
lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.730s
lmdeploy - INFO - ImageEncoder forward 3 images, cost 2.356s
lmdeploy - INFO - ImageEncoder forward 2 images, cost 1.445s
lmdeploy - INFO - ImageEncoder forward 2 images, cost 1.984s

Jul 26 '24 06:07 LatentLinker

For best performance, I think it's better to split the vision and llm model and serve the vision model with tensort backend.

Jul 26 '24 07:07 irexyc

@irexyc Do you have plans to optimize this problem in the future

Jul 26 '24 07:07 LatentLinker

We don't have plan to optimize the vision model yet. If you are interested, you can convert the vision model to tensorrt and test the performance.

Jul 26 '24 07:07 irexyc

@irexyc Okay, thanks for your reply. I'll try the TensorRT deployment vision model.

Jul 26 '24 07:07 LatentLinker

@irexyc 我如果分离视觉模型和语言模型，我应该如何把视觉模型的输出加入到语言模型的prompt中，我查看相关原始代码，input_embeddings input_embedding_ranges 是有关图片features的，我应该如何把这些信息通过 openai.client 加入到请求中

Jul 29 '24 06:07 LatentLinker

通过 openai.client 传感觉会比较麻烦，不光涉及server接口的修改，AsyncEngine 的接口也要改。我觉得稍微简单的方式是改vit model，用trt的python接口实现 vit 的加载和推理

trt模型是转好了是么，有相关的测速结果么？

Jul 29 '24 07:07 irexyc

没有转换好呢，我现在在测试，测试过程中，我想到这个问题，看了下lmdeploy 相关源代码，input_embeddings input_embedding_ranges 这两参数没有传输入口，我先转换完成测下视觉模型测试下性能

Jul 29 '24 07:07 LatentLinker

通过 openai.client 传感觉会比较麻烦，不光涉及server接口的修改，AsyncEngine 的接口也要改。我觉得稍微简单的方式是改vit model，用trt的python接口实现 vit 的加载和推理

trt模型是转好了是么，有相关的测速结果么？

你好我也遇到了 vit 推理慢的问题，研究了一下转成 trt_engine，发现这个模型 trt_llm 的 get_visual_features 实现得到的维度是 [1, 256, 6144] ，lmdeploy 里经过 dynamic_preprocess 之后 self.model.extract_feature(pixel_values) 得到的维度是[13, 256, 6144]，如果使用 trt 来实现 vit 的推理这里要如何处理使两边能够对齐呢

Mar 01 '25 10:03 e1ijah1