[Bug] relatively slow speed after deploy InternVL2-26B
Checklist
- [X] 1. I have searched related issues but cannot get the expected help.
- [ ] 2. The bug has not been fixed in the latest version.
- [X] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
Describe the bug
I have searched for relevant questions. Is the speed relatively slow because lmdeploy does not optimize the vision model, thus slowing down the entire request time? I want to know lmdeploy can support vision models multiprocess images or batch images
Reproduction
# start server
lmdeploy serve api_server --cache-max-entry-count 0.6 InternVL2-26B/ --server-port 23333
# 4 concurrency 100 prompts without image url
python profile_restful_api_image.py http://127.0.0.1:23333 InternVL2-26B/ HC3-Chinese/all.jsonl --stream_output true --concurrency 4 --num_prompts 100
# res
--------------------------------------------------
concurrency: 4
elapsed_time: 69.157s
first_token latency(min, max, ave): 0.039s, 0.162s, 0.063s
number of prompt tokens: 1787
number of completion tokens: 10001
token throughput (completion token): 144.612 token/s
token throughput (prompt + completion token): 170.452 token/s
RPS (request per second): 1.446 req/s
RPM (request per minute): 86.759 req/min
--------------------------------------------------
# 4 concurrency 100 prompts with image url
python profile_restful_api_image.py http://127.0.0.1:23333 InternVL2-26B/ HC3-Chinese/all.jsonl --stream_output true --concurrency 4 --num_prompts 100 --use_image true
# res
--------------------------------------------------
concurrency: 4
elapsed_time: 160.245s
first_token latency(min, max, ave): 1.065s, 3.772s, 1.390s
number of prompt tokens: 1787
number of completion tokens: 9932
token throughput (completion token): 61.980 token/s
token throughput (prompt + completion token): 73.132 token/s
RPS (request per second): 0.624 req/s
RPM (request per minute): 37.443 req/min
--------------------------------------------------
Environment
sys.platform: linux
Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0: NVIDIA A100-SXM4-80GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.5, V12.5.40
GCC: x86_64-linux-gnu-gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
PyTorch: 2.2.2+cu121
PyTorch compiling details: PyTorch built with:
- GCC 9.3
- C++ Version: 201703
- Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX512
- CUDA Runtime 12.1
- NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
- CuDNN 8.9.2
- Magma 2.6.1
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,
TorchVision: 0.17.2+cu121
LMDeploy: 0.5.1+
transformers: 4.37.2
gradio: 4.39.0
fastapi: 0.110.0
pydantic: 2.6.3
triton: 2.2.0
NVIDIA Topology:
GPU0 NIC0 NIC1 NIC2 NIC3 NIC4 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NODE PXB SYS SYS NODE 0-23,48-71 0 N/A
NIC0 NODE X NODE SYS SYS NODE
NIC1 PXB NODE X SYS SYS NODE
NIC2 SYS SYS SYS X NODE SYS
NIC3 SYS SYS SYS NODE X SYS
NIC4 NODE NODE NODE SYS SYS X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_2
NIC1: mlx5_3
NIC2: mlx5_4
NIC3: mlx5_5
NIC4: mlx5_bond_0
Error traceback
No response
You can set --vision-max-batch-size when you start the server. But there is small probability of forming a batch except there are multiple images in one request or the interval between two requests is very small. Or maybe we could await a little more time here
With the pytorch engine, the time increases almost linearly with the batch size grows especially when the vit model is large.
BTW, can you share the file of profile_restful_api_image.py ?
@irexyc Thank you for your reply. this is profile_restful_api_image.py I'll try --vision-max-batch-size later
import csv
import json
import random
import time
from queue import Queue
from threading import Thread
from typing import List, Optional, Tuple
import fire
import numpy as np
from tqdm import tqdm
from transformers import AutoTokenizer
from lmdeploy.serve.openai.api_client import APIClient
def sample_requests(
dataset_path: str,
num_requests: int,
tokenizer: AutoTokenizer,
) -> List[Tuple[str, int, int]]:
# Load the dataset.
dataset = []
for line in open(dataset_path):
dataset.append(json.loads(line))
# Tokenize the prompts and completions.
prompts = [prompt["question"] for prompt in dataset]
prompt_token_ids = tokenizer(prompts).input_ids
completions = [prompt["chatgpt_answers"][0] for prompt in dataset]
completion_token_ids = tokenizer(completions).input_ids
tokenized_dataset = []
for i in range(len(dataset)):
max_tokens = len(completion_token_ids[i])
prompt_len = len(prompt_token_ids[i])
if prompt_len < 4 or max_tokens < 4:
# Prune too short sequences.
continue
if prompt_len > 1024 or prompt_len + max_tokens > 2048:
# Prune too long sequences.
continue
tokenized_dataset.append((prompts[i], prompt_len, max_tokens))
if len(tokenized_dataset) < num_requests:
b = num_requests // len(tokenized_dataset)
tokenized_dataset += tokenized_dataset * b
sampled_requests = random.sample(tokenized_dataset, num_requests)
random.shuffle(sampled_requests)
return sampled_requests
class Engine:
def __init__(self,
server_addr: str,
tokenzier_path: str,
temperature: float = 0.8,
top_p: float = 1.0,
csv: str = '',
api_key: Optional[str] = None,
model_name: Optional[str] = None,
use_image: Optional[bool] = False,
**kwargs):
self.tokenizer = AutoTokenizer.from_pretrained(tokenzier_path,
trust_remote_code=True)
self.server_addr = server_addr
self.temperature = temperature
self.top_p = top_p
self.csv = csv
self.api_key = api_key
self.use_image = use_image
client = APIClient(self.server_addr, api_key=self.api_key)
if model_name is None:
self.model_name = client.available_models[0]
print(f'using model: {self.model_name}\n')
else:
self.model_name = model_name
self.pbar = None
def _inference(self, req_queue: Queue, res_queue: Queue, session_id: int,
stream_output: bool):
stats = []
client = APIClient(self.server_addr, api_key=self.api_key)
for prompt, input_seqlen, max_tokens in iter(
req_queue.get, [None, None, None]):
timestamps = []
timestamps.append(time.perf_counter())
if self.use_image:
messages = [{"role": "user", 'content': [
{
'type': 'text',
'text': prompt,
},
{
'type': 'image_url',
'image_url': {
'url': "https://img1.baidu.com/it/u=4157744492,1349578166&fm=253&fmt=auto&app=120&f=JPEG?w=500&h=750"
},
}
]}]
else:
messages = [{"role": "user", 'content': [
{
'type': 'text',
'text': prompt,
},
]}]
answer = ""
for output in client.chat_completions_v1(
model=self.model_name,
messages=messages,
temperature=self.temperature,
top_p=self.top_p,
n=1,
stream=stream_output,
max_tokens=max_tokens,
session_id=session_id,
ignore_eos=True):
answer += output["choices"][0]["delta"]["content"]
timestamps.append(time.perf_counter())
output_seqlen = len(self.tokenizer(answer).input_ids)
first_token_latency = np.round(timestamps[1] - timestamps[0], 3)
token_latency = np.round(timestamps[-1] - timestamps[0], 3)
# assert output.pop('finish_reason') == 'length', \
# f'Error. session_id({session_id}) request {output_seqlen} ' \
# f'tokens, but `finish_reason` is not `length`'
total_tokens = input_seqlen + output_seqlen
stats.append([
first_token_latency, output_seqlen, output_seqlen,
total_tokens, token_latency
])
self.pbar.update(1)
res_queue.put((session_id, stats))
def process_request(self,
requests,
concurrency: int = 1,
stream_output: bool = False):
res_queue = Queue()
req_queue = Queue()
threads = []
self.pbar = tqdm(total=len(requests))
# feed request to q
for req in requests:
req_queue.put(req)
for i in range(concurrency):
req_queue.put([None, None, None])
start = time.time()
# start threads
for i in range(concurrency):
t = Thread(target=self._inference,
args=(req_queue, res_queue, i, stream_output))
t.start()
threads.append(t)
# wait for finish
for t in threads:
t.join()
elapsed_time = time.time() - start
stats = []
while not res_queue.empty():
session_id, _stats = res_queue.get()
if len(_stats) != 0:
stats.append(np.array(_stats))
stats = np.concatenate(stats).reshape(-1, 5)
first_token_latency_min = np.min(stats[:, 0], axis=0)
first_token_latency_max = np.max(stats[:, 0], axis=0)
first_token_latency_ave = np.mean(stats[:, 0], axis=0)
completion_tokens = np.sum(stats[:, 1], axis=0)
request_output_tokens = np.sum(stats[:, 2], axis=0)
total_tokens = np.sum(stats[:, 3], axis=0)
prompt_tokens = total_tokens - completion_tokens
completion_token_throughput = completion_tokens / elapsed_time
total_token_throughput = total_tokens / elapsed_time
rps = len(requests) / elapsed_time
rpm = rps * 60
if (np.abs(stats[:, 1] - stats[:, 2]) <= 1).min() is False:
print(f'Did not generate requested number of tokens. '
f'Request {request_output_tokens:.0f}, '
f'but got {completion_tokens:.0f}')
print(f'\n{"-" * 50}\nconcurrency: {concurrency}\n'
f'elapsed_time: {elapsed_time:.3f}s\n')
if stream_output:
print(f'first_token latency(min, max, ave): '
f'{first_token_latency_min:.3f}s, '
f'{first_token_latency_max:.3f}s, '
f'{first_token_latency_ave:.3f}s\n')
print(
f'number of prompt tokens: {prompt_tokens:.0f}\n'
f'number of completion tokens: {completion_tokens:.0f}\n'
f'token throughput (completion token): {completion_token_throughput:.3f} token/s\n' # noqa
f'token throughput (prompt + completion token): {total_token_throughput:.3f} token/s\n' # noqa
f'RPS (request per second): {rps:.3f} req/s\n'
f'RPM (request per minute): {rpm:.3f} req/min\n'
f'{"-" * 50}\n')
if self.csv:
with open(self.csv, 'w') as csvfile:
writer = csv.writer(csvfile)
writer.writerow([
'batch', 'num_prompts', 'RPS', 'RPM', 'FTL(ave)(s)',
'FTL(min)(s)', 'FTL(max)(s)', 'throughput(out tok/s)',
'throughput(total tok/s)'
])
writer.writerow([
concurrency,
len(requests), f'{rps:.3f}', f'{rpm:.3f}',
f'{first_token_latency_ave:.3f}' if stream_output else '-',
f'{first_token_latency_min:.3f}' if stream_output else '-',
f'{first_token_latency_max:.3f}' if stream_output else '-',
f'{completion_token_throughput:.3f}',
f'{total_token_throughput:.3f}'
])
def main(server_addr: str,
tokenizer_path: str,
dataset: str,
api_key: Optional[str] = None,
model_name: Optional[str] = None,
concurrency: int = 128,
num_prompts: int = 5000,
top_p: float = 1.0,
temperature: float = 1.0,
stream_output: bool = False,
csv: str = './profile_api_server.csv',
seed: int = 0,
use_image: bool = False):
"""Benchmark the request througput of api server.
Args:
server_addr (str): http url of api_server with format http://0.0.0.0:0
tokenizer_path (str): Path to the tokenizer model in localhost
dataset (str): Path to the dataset
concurrency (int, optional): Number of working threads to process the sampled prompts.
Defaults to 128.
num_prompts (int, optional): Number of prompts to process. Defaults to 5000.
top_p (float, optional): the set of most probable tokens with
probabilities that add up to top_p or higher
are kept for generation. Defaults to 1.0.
temperature (float, optional): The value used to modulate the next token probabilities.
Defaults to 1.0.
stream_output (bool, optional): Indicator for streaming output. Defaults to False.
csv (str, optional): The path to save the result.
seed (int, optional): Seed used in sampling prompts from dataset. Defaults to 0.
use_image (bool, optional): whether to add image parameters. Defaults to False.
""" # noqa
if not server_addr.startswith('http://'):
print(f'[WARNING] server_addr of the api_server should '
f'start with "http://", but got "{server_addr}"')
server_addr = 'http://' + server_addr.strip()
random.seed(seed)
engine = Engine(server_addr,
tokenizer_path,
top_p=top_p,
temperature=temperature,
csv=csv,
api_key=api_key,
model_name=model_name,
use_image=use_image)
requests = sample_requests(dataset, num_prompts, engine.tokenizer)
engine.process_request(requests, concurrency, stream_output)
if __name__ == '__main__':
fire.Fire(main)
@irexyc This is how I feel after using '--vision-max-batch-size' as I would have without it
lmdeploy serve api_server --cache-max-entry-count 0.6 /home/notebook/data/personal/W9088934/InternVL2-26B/ --server-port 23333 --vision-max-batch-size 8
python profile_restful_api_image.py http://127.0.0.1:23333 /home/notebook/data/personal/W9088934/InternVL2-26B /home/notebook/data/personal/W9088934/datasets/HC3-Chinese/all.jsonl --stream_output true --concurrency 4 --num_prompts 100 --use_image true
--------------------------------------------------
concurrency: 4
elapsed_time: 157.052s
first_token latency(min, max, ave): 1.028s, 4.047s, 1.282s
number of prompt tokens: 1787
number of completion tokens: 9956
token throughput (completion token): 63.393 token/s
token throughput (prompt + completion token): 74.771 token/s
RPS (request per second): 0.637 req/s
RPM (request per minute): 38.204 req/min
--------------------------------------------------
For this test script, the --vision-max-batch-size almost takes no effect as each thread will rarely send requests at the same time except for the first request.
BTW, tt's better to use base64 format of image as it can remove the image downloading time. You can refer to this
@irexyc According to your comments I modified the test script and added it using the startup service --log-level INFO
import asyncio
import base64
import csv
import json
import random
import time
# from queue import Queue
from asyncio import Queue
from typing import List, Optional, Tuple
import fire
import numpy as np
from tqdm import tqdm
from transformers import AutoTokenizer
from openai import AsyncOpenAI
with open("example.png", "rb") as f:
bs64_img = f'data:image/jpeg;base64,{base64.b64encode(f.read()).decode("utf-8")}'
def sample_requests(
dataset_path: str,
num_requests: int,
tokenizer: AutoTokenizer,
) -> List[Tuple[str, int, int]]:
# Load the dataset.
dataset = []
for line in open(dataset_path):
dataset.append(json.loads(line))
# Tokenize the prompts and completions.
prompts = [prompt["question"] for prompt in dataset]
prompt_token_ids = tokenizer(prompts).input_ids
completions = [prompt["chatgpt_answers"][0] for prompt in dataset]
completion_token_ids = tokenizer(completions).input_ids
tokenized_dataset = []
for i in range(len(dataset)):
max_tokens = len(completion_token_ids[i])
prompt_len = len(prompt_token_ids[i])
if prompt_len < 4 or max_tokens < 4:
# Prune too short sequences.
continue
if prompt_len > 1024 or prompt_len + max_tokens > 2048:
# Prune too long sequences.
continue
tokenized_dataset.append((prompts[i], prompt_len, max_tokens))
if len(tokenized_dataset) < num_requests:
b = num_requests // len(tokenized_dataset)
tokenized_dataset += tokenized_dataset * b
sampled_requests = random.sample(tokenized_dataset, num_requests)
random.shuffle(sampled_requests)
return sampled_requests
class Engine:
def __init__(self,
server_addr: str,
tokenzier_path: str,
temperature: float = 0.8,
top_p: float = 1.0,
csv: str = '',
api_key: Optional[str] = None,
model_name: Optional[str] = None,
use_image: Optional[bool] = False,
**kwargs):
self.tokenizer = AutoTokenizer.from_pretrained(tokenzier_path,
trust_remote_code=True)
self.server_addr = server_addr
self.temperature = temperature
self.top_p = top_p
self.csv = csv
self.api_key = api_key
self.use_image = use_image
self.pbar = None
self.model_name = model_name
async def set_model_name(self, model_name):
if model_name is None:
client = AsyncOpenAI(api_key='YOUR_API_KEY', base_url=self.server_addr)
models = await client.models.list()
self.model_name = models.data[0].id
print(f'using model: {self.model_name}\n')
else:
self.model_name = model_name
async def _inference(self, data: Tuple[str, int, int], res_queue: Queue, session_id: int,
stream_output: bool):
stats = []
client = AsyncOpenAI(api_key='YOUR_API_KEY', base_url=self.server_addr)
prompt, input_seqlen, max_tokens = data
timestamps = []
timestamps.append(time.perf_counter())
if self.use_image:
messages = [{"role": "user", 'content': [
{
'type': 'text',
'text': prompt,
},
{
'type': 'image_url',
'image_url': {
'url': bs64_img
},
}
]}]
else:
messages = [{"role": "user", 'content': [
{
'type': 'text',
'text': prompt,
},
]}]
response = await client.chat.completions.create(
model=self.model_name,
messages=messages,
temperature=self.temperature,
top_p=self.top_p,
n=1,
stream=stream_output,
max_tokens=max_tokens,
)
answer = ""
async for i in response:
delta = i.choices[0].delta.content
answer += delta
timestamps.append(time.perf_counter())
output_seqlen = len(self.tokenizer(answer).input_ids)
first_token_latency = np.round(timestamps[1] - timestamps[0], 3)
token_latency = np.round(timestamps[-1] - timestamps[0], 3)
total_tokens = input_seqlen + output_seqlen
stats.append([
first_token_latency, output_seqlen, output_seqlen,
total_tokens, token_latency
])
self.pbar.update(1)
await res_queue.put((session_id, stats))
async def process_request(self,
requests,
concurrency: int = 1,
stream_output: bool = False):
res_queue = Queue()
self.pbar = tqdm(total=len(requests))
start = time.time()
workers = []
for i, data in enumerate(requests, start=1):
workers.append(self._inference(data, res_queue, i % concurrency, stream_output))
if i % concurrency == 0:
await asyncio.gather(*workers)
workers.clear()
elapsed_time = time.time() - start
stats = []
while not res_queue.empty():
session_id, _stats = await res_queue.get()
if len(_stats) != 0:
stats.append(np.array(_stats))
stats = np.concatenate(stats).reshape(-1, 5)
first_token_latency_min = np.min(stats[:, 0], axis=0)
first_token_latency_max = np.max(stats[:, 0], axis=0)
first_token_latency_ave = np.mean(stats[:, 0], axis=0)
completion_tokens = np.sum(stats[:, 1], axis=0)
request_output_tokens = np.sum(stats[:, 2], axis=0)
total_tokens = np.sum(stats[:, 3], axis=0)
prompt_tokens = total_tokens - completion_tokens
completion_token_throughput = completion_tokens / elapsed_time
total_token_throughput = total_tokens / elapsed_time
rps = len(requests) / elapsed_time
rpm = rps * 60
if (np.abs(stats[:, 1] - stats[:, 2]) <= 1).min() is False:
print(f'Did not generate requested number of tokens. '
f'Request {request_output_tokens:.0f}, '
f'but got {completion_tokens:.0f}')
print(f'\n{"-" * 50}\nconcurrency: {concurrency}\n'
f'elapsed_time: {elapsed_time:.3f}s\n')
if stream_output:
print(f'first_token latency(min, max, ave): '
f'{first_token_latency_min:.3f}s, '
f'{first_token_latency_max:.3f}s, '
f'{first_token_latency_ave:.3f}s\n')
print(
f'number of prompt tokens: {prompt_tokens:.0f}\n'
f'number of completion tokens: {completion_tokens:.0f}\n'
f'token throughput (completion token): {completion_token_throughput:.3f} token/s\n' # noqa
f'token throughput (prompt + completion token): {total_token_throughput:.3f} token/s\n' # noqa
f'RPS (request per second): {rps:.3f} req/s\n'
f'RPM (request per minute): {rpm:.3f} req/min\n'
f'{"-" * 50}\n')
if self.csv:
with open(self.csv, 'w') as csvfile:
writer = csv.writer(csvfile)
writer.writerow([
'batch', 'num_prompts', 'RPS', 'RPM', 'FTL(ave)(s)',
'FTL(min)(s)', 'FTL(max)(s)', 'throughput(out tok/s)',
'throughput(total tok/s)'
])
writer.writerow([
concurrency,
len(requests), f'{rps:.3f}', f'{rpm:.3f}',
f'{first_token_latency_ave:.3f}' if stream_output else '-',
f'{first_token_latency_min:.3f}' if stream_output else '-',
f'{first_token_latency_max:.3f}' if stream_output else '-',
f'{completion_token_throughput:.3f}',
f'{total_token_throughput:.3f}'
])
async def start(engine, model_name, dataset, num_prompts, concurrency, stream_output):
await engine.set_model_name(model_name)
requests = sample_requests(dataset, num_prompts, engine.tokenizer)
await engine.process_request(requests, concurrency, stream_output)
def main(server_addr: str,
tokenizer_path: str,
dataset: str,
api_key: Optional[str] = None,
model_name: Optional[str] = None,
concurrency: int = 128,
num_prompts: int = 5000,
top_p: float = 1.0,
temperature: float = 1.0,
stream_output: bool = False,
csv: str = './profile_api_server.csv',
seed: int = 0,
use_image: bool = False):
"""Benchmark the request througput of api server.
Args:
server_addr (str): http url of api_server with format http://0.0.0.0:0
tokenizer_path (str): Path to the tokenizer model in localhost
dataset (str): Path to the dataset
concurrency (int, optional): Number of working threads to process the sampled prompts.
Defaults to 128.
num_prompts (int, optional): Number of prompts to process. Defaults to 5000.
top_p (float, optional): the set of most probable tokens with
probabilities that add up to top_p or higher
are kept for generation. Defaults to 1.0.
temperature (float, optional): The value used to modulate the next token probabilities.
Defaults to 1.0.
stream_output (bool, optional): Indicator for streaming output. Defaults to False.
csv (str, optional): The path to save the result.
seed (int, optional): Seed used in sampling prompts from dataset. Defaults to 0.
use_image (bool, optional): whether to add image parameters. Defaults to False.
""" # noqa
if not server_addr.startswith('http://'):
print(f'[WARNING] server_addr of the api_server should '
f'start with "http://", but got "{server_addr}"')
server_addr = 'http://' + server_addr.strip()
random.seed(seed)
engine = Engine(server_addr,
tokenizer_path,
top_p=top_p,
temperature=temperature,
csv=csv,
api_key=api_key,
model_name=model_name,
use_image=use_image)
asyncio.run(start(engine, model_name, dataset, num_prompts, concurrency, stream_output))
if __name__ == '__main__':
fire.Fire(main)
Then I see the batch image log. It feels like there's a linear relationship between time and number of pictures. it is not quite effective.
lmdeploy - INFO - ImageEncoder forward 2 images, cost 1.700s
lmdeploy - INFO - ImageEncoder forward 2 images, cost 2.131s
lmdeploy - INFO - ImageEncoder forward 2 images, cost 1.441s
lmdeploy - INFO - ImageEncoder forward 2 images, cost 2.028s
lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.738s
lmdeploy - INFO - ImageEncoder forward 3 images, cost 2.357s
lmdeploy - INFO - ImageEncoder forward 2 images, cost 1.437s
lmdeploy - INFO - ImageEncoder forward 2 images, cost 1.982s
lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.752s
lmdeploy - INFO - ImageEncoder forward 3 images, cost 2.369s
lmdeploy - INFO - ImageEncoder forward 2 images, cost 1.450s
lmdeploy - INFO - ImageEncoder forward 2 images, cost 1.984s
lmdeploy - INFO - ImageEncoder forward 4 images, cost 2.844s
lmdeploy - INFO - ImageEncoder forward 3 images, cost 2.155s
lmdeploy - INFO - ImageEncoder forward 1 images, cost 1.577s
lmdeploy - INFO - ImageEncoder forward 2 images, cost 1.466s
lmdeploy - INFO - ImageEncoder forward 2 images, cost 1.977s
lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.759s
lmdeploy - INFO - ImageEncoder forward 3 images, cost 2.386s
lmdeploy - INFO - ImageEncoder forward 2 images, cost 1.457s
lmdeploy - INFO - ImageEncoder forward 2 images, cost 2.003s
lmdeploy - INFO - ImageEncoder forward 2 images, cost 1.443s
lmdeploy - INFO - ImageEncoder forward 2 images, cost 1.978s
lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.743s
lmdeploy - INFO - ImageEncoder forward 3 images, cost 2.353s
lmdeploy - INFO - ImageEncoder forward 2 images, cost 1.452s
lmdeploy - INFO - ImageEncoder forward 2 images, cost 1.974s
lmdeploy - INFO - ImageEncoder forward 4 images, cost 2.848s
lmdeploy - INFO - ImageEncoder forward 3 images, cost 2.133s
lmdeploy - INFO - ImageEncoder forward 1 images, cost 1.571s
lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.752s
lmdeploy - INFO - ImageEncoder forward 3 images, cost 2.369s
lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.731s
lmdeploy - INFO - ImageEncoder forward 3 images, cost 2.363s
lmdeploy - INFO - ImageEncoder forward 2 images, cost 1.452s
lmdeploy - INFO - ImageEncoder forward 2 images, cost 1.989s
lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.729s
lmdeploy - INFO - ImageEncoder forward 3 images, cost 2.365s
lmdeploy - INFO - ImageEncoder forward 2 images, cost 1.444s
lmdeploy - INFO - ImageEncoder forward 2 images, cost 1.985s
lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.729s
lmdeploy - INFO - ImageEncoder forward 3 images, cost 2.362s
lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.742s
lmdeploy - INFO - ImageEncoder forward 3 images, cost 2.386s
lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.730s
lmdeploy - INFO - ImageEncoder forward 3 images, cost 2.356s
lmdeploy - INFO - ImageEncoder forward 2 images, cost 1.445s
lmdeploy - INFO - ImageEncoder forward 2 images, cost 1.984s
For best performance, I think it's better to split the vision and llm model and serve the vision model with tensort backend.
@irexyc Do you have plans to optimize this problem in the future
We don't have plan to optimize the vision model yet. If you are interested, you can convert the vision model to tensorrt and test the performance.
@irexyc Okay, thanks for your reply. I'll try the TensorRT deployment vision model.
@irexyc 我如果分离视觉模型和语言模型,我应该如何把视觉模型的输出加入到语言模型的prompt中,我查看相关原始代码,input_embeddings input_embedding_ranges 是有关图片features的,我应该如何把 这些信息通过 openai.client 加入到请求中
通过 openai.client 传感觉会比较麻烦,不光涉及server接口的修改,AsyncEngine 的接口也要改。我觉得稍微简单的方式是改vit model,用trt的python接口实现 vit 的加载和推理
trt模型是转好了是么,有相关的测速结果么?
没有转换好呢,我现在在测试,测试过程中,我想到这个问题,看了下lmdeploy 相关源代码,input_embeddings input_embedding_ranges 这两参数没有传输入口,我先转换完成测下视觉模型测试下性能
通过 openai.client 传感觉会比较麻烦,不光涉及server接口的修改,AsyncEngine 的接口也要改。我觉得稍微简单的方式是改vit model,用trt的python接口实现 vit 的加载和推理
trt模型是转好了是么,有相关的测速结果么?
你好我也遇到了 vit 推理慢的问题,研究了一下转成 trt_engine,发现 这个模型 trt_llm 的 get_visual_features 实现得到的维度是 [1, 256, 6144] ,lmdeploy 里经过 dynamic_preprocess 之后 self.model.extract_feature(pixel_values) 得到的维度是[13, 256, 6144],如果使用 trt 来实现 vit 的推理这里要如何处理使两边能够对齐呢