InternVL icon indicating copy to clipboard operation
InternVL copied to clipboard

InternVL2-40B-AWQ+lmdeploy video infer speed is very slow

Open AmazDeng opened this issue 1 year ago • 16 comments

I am using the InternVL2-40B-AWQ model and performing video inference according to the multi-images inference paradigm. Each video is sampled into 24 frames, and the prompt is as follows. My questions are:

1.When performing video inference, is the video processed according to the multi-images paradigm? Is this method correct? Is there a specific prompt format for video inference? On the official website(https://internvl.readthedocs.io/en/latest/internvl2.0/deployment.html#), I only see inference for single images and multi-images, but I haven't found inference code specifically for videos.

2.During video inference (following the multi-images paradigm) with num_segments=24, I encounter an error: lmdeploy - ERROR - Truncate max_new_tokens to 128 when session_len=8192. Through repeated testing, I found that session_len=65536 is necessary for correct inference. I want to ask, how many tokens does a single image actually consume before being input to the language model? According to this issue, the model's max_seq_length is at most 8192 (https://github.com/OpenGVLab/InternVL/issues/381), so why can I set session_len=65536? The related link is: https://github.com/InternLM/lmdeploy/issues/2382.

3.During video inference (following the multi-images paradigm) with num_segments=24, I noticed that inference with lmdeploy takes about 60 seconds, while transformers inference takes around 9 seconds. Why is this? Is there room for optimization? Why is the inference speed of lmdeploy significantly slower than that of PyTorch?

code demo

from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image
from lmdeploy import GenerationConfig
import time


target_images=["2.jpg","6.jpg","11.jpg","15.jpg","20.jpg","24.jpg","28.jpg","33.jpg","37.jpg","41.jpg","46.jpg","50.jpg","54.jpg","59.jpg","63.jpg","68.jpg","72.jpg","76.jpg","81.jpg","85.jpg","90.jpg","94.jpg","98.jpg","103.jpg"]

# 2,   6,  11,  15,  20,  24,  28,  33,  37,  41,  46,  50,  54,
#         59,  63,  68,  72,  76,  81,  85,  90,  94,  98, 103

pixel_values_list=[]
for ele in target_images:
    img=load_image(f"./image/{ele}")
    pixel_values_list.append(img)
    
question="Frame1: <img>{IMAGE_TOKEN}</img>\nFrame2: <img>{IMAGE_TOKEN}</img>\nFrame3: <img>{IMAGE_TOKEN}</img>\nFrame4: <img>{IMAGE_TOKEN}</img>\nFrame5: <img>{IMAGE_TOKEN}</img>\nFrame6: <img>{IMAGE_TOKEN}</img>\nFrame7: <img>{IMAGE_TOKEN}</img>\nFrame8: <img>{IMAGE_TOKEN}</img>\nFrame9: <img>{IMAGE_TOKEN}</img>\nFrame10: <img>{IMAGE_TOKEN}</img>\nFrame11: <img>{IMAGE_TOKEN}</img>\nFrame12: <img>{IMAGE_TOKEN}</img>\nFrame13: <img>{IMAGE_TOKEN}</img>\nFrame14: <img>{IMAGE_TOKEN}</img>\nFrame15: <img>{IMAGE_TOKEN}</img>\nFrame16: <img>{IMAGE_TOKEN}</img>\nFrame17: <img>{IMAGE_TOKEN}</img>\nFrame18: <img>{IMAGE_TOKEN}</img>\nFrame19: <img>{IMAGE_TOKEN}</img>\nFrame20: <img>{IMAGE_TOKEN}</img>\nFrame21: <img>{IMAGE_TOKEN}</img>\nFrame22: <img>{IMAGE_TOKEN}</img>\nFrame23: <img>{IMAGE_TOKEN}</img>\nFrame24: <img>{IMAGE_TOKEN}</img>\nIs a person in the video?Answer Yes or No in one word."


model_path = '/media/star/disk2/pretrained_model/InternVL/InternVL2/InternVL2-40B-AWQ'
#model_path = '/media/star/disk2/pretrained_model/InternVL/InternVL2/InternVL2-1B'
model = pipeline(model_path, backend_config=TurbomindEngineConfig(session_len=65536))
start=time.time()
response_l = model([(question, pixel_values_list)],gen_config=GenerationConfig(max_new_tokens=1024,top_p=1))
print(response_l[0].text)
print(f"run time is {time.time()-start}")

images image.zip

AmazDeng avatar Aug 30 '24 07:08 AmazDeng

@czczup @whai362 @ErfeiCui @hjh0119 @lvhan028 @Adushar @Weiyun1025 @cg1177 @opengvlab-admin @qishisuren123 @dlutwy Could you please take a look at this issue?

AmazDeng avatar Aug 30 '24 07:08 AmazDeng

@AmazDeng

For video understand with internvl2, you can refer this docs (the video multi-round conversation part)

To see how many input token used, you can turn on the log with pipe = pipeline('...', log_level='INFO'). For internvl2, single image with one patch will use 256 tokens. By default, a single image will have 1~13 patches depending on its aspect ratio. Reducing the max_dynamic_patch will speed up the inference and the official video demo also use one patch for single image in video clip.

related issue https://github.com/InternLM/lmdeploy/issues/2260

irexyc avatar Aug 30 '24 08:08 irexyc

@irexyc

The example of video multi-round conversation and the provided code are quite similar. I also conducted tests following the video multi-round conversation example using InternVL2-1B. The inference time was also approximately 40 seconds. This indicates that the inference time with lmdeploy is significantly slower than with transformers.

I've uploaded the video and code, you can also test it on your own machine.

test code

import numpy as np
from lmdeploy import pipeline, GenerationConfig
from decord import VideoReader, cpu
from lmdeploy.vl.constants import IMAGE_TOKEN
from lmdeploy.vl.utils import encode_image_base64
from PIL import Image
import time
from lmdeploy import TurbomindEngineConfig

pipe = pipeline('/media/star/disk2/pretrained_model/InternVL/InternVL2/InternVL2-1B',backend_config=TurbomindEngineConfig(session_len=65536))


def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
    if bound:
        start, end = bound[0], bound[1]
    else:
        start, end = -100000, 100000
    start_idx = max(first_idx, round(start * fps))
    end_idx = min(round(end * fps), max_frame)
    seg_size = float(end_idx - start_idx) / num_segments
    frame_indices = np.array([
        int(start_idx + (seg_size / 2) + np.round(seg_size * idx))
        for idx in range(num_segments)
    ])
    return frame_indices


def load_video(video_path, bound=None, num_segments=32):
    vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
    max_frame = len(vr) - 1
    fps = float(vr.get_avg_fps())
    pixel_values_list, num_patches_list = [], []
    frame_indices = get_index(bound, fps, max_frame, first_idx=0, num_segments=num_segments)
    imgs = []
    for frame_index in frame_indices:
        img = Image.fromarray(vr[frame_index].asnumpy()).convert('RGB')
        imgs.append(img)
    return imgs


video_path = '/media/star/8T/tmp/1.mp4'
imgs = load_video(video_path, num_segments=24)

question = ''
for i in range(len(imgs)):
    question = question + f'Frame{i+1}: <img>{IMAGE_TOKEN}</img>\n'

question += 'Is there a person in the video?Answer Yes or No in one word.'
print(f"question={question}")

content = [{'type': 'text', 'text': question}]
for img in imgs:
    content.append({'type': 'image_url', 'image_url': {'max_dynamic_patch': 1, 'url': f'data:image/jpeg;base64,{encode_image_base64(img)}'}})

messages = [dict(role='user', content=content)]
out = pipe(messages, gen_config=GenerationConfig(top_k=1))

messages.append(dict(role='assistant', content=out.text))
messages.append(dict(role='user', content='Describe this video in detail. Don\'t repeat.'))
start=time.time()
for index in range(5):
    out = pipe(messages, gen_config=GenerationConfig(top_k=1))
    print(f"out={out}")
print(f"average run time is {(time.time()-start)/5} seconds")

https://github.com/user-attachments/assets/25bbb29a-28c0-498c-9ff9-78c050b9c3c6

AmazDeng avatar Aug 30 '24 11:08 AmazDeng

In my test, InternVL2-1B (which can not turbomind backend and will fallback to pytorch backend) takes an average of 5.4s. InternVL2-2B takes an average of 1.8s.

irexyc avatar Sep 03 '24 03:09 irexyc

In my test, InternVL2-1B (which can not turbomind backend and will fallback to pytorch backend) takes an average of 5.4s. InternVL2-2B takes an average of 1.8s.

@irexyc May I ask if you are testing the code and images that I provided before? It takes 50 seconds for the InternVL2-1B I tested to run. What could be the reason for this? test code

from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image
from lmdeploy import GenerationConfig
import time

target_images = ["2.jpg", "6.jpg", "11.jpg", "15.jpg", "20.jpg", "24.jpg", "28.jpg", "33.jpg", "37.jpg", "41.jpg",
                 "46.jpg", "50.jpg", "54.jpg", "59.jpg", "63.jpg", "68.jpg", "72.jpg", "76.jpg", "81.jpg", "85.jpg",
                 "90.jpg", "94.jpg", "98.jpg", "103.jpg"]

pixel_values_list = []
for ele in target_images:
    img = load_image(f"./image/{ele}")
    pixel_values_list.append(img)

question = "Frame1: <img>{IMAGE_TOKEN}</img>\nFrame2: <img>{IMAGE_TOKEN}</img>\nFrame3: <img>{IMAGE_TOKEN}</img>\nFrame4: <img>{IMAGE_TOKEN}</img>\nFrame5: <img>{IMAGE_TOKEN}</img>\nFrame6: <img>{IMAGE_TOKEN}</img>\nFrame7: <img>{IMAGE_TOKEN}</img>\nFrame8: <img>{IMAGE_TOKEN}</img>\nFrame9: <img>{IMAGE_TOKEN}</img>\nFrame10: <img>{IMAGE_TOKEN}</img>\nFrame11: <img>{IMAGE_TOKEN}</img>\nFrame12: <img>{IMAGE_TOKEN}</img>\nFrame13: <img>{IMAGE_TOKEN}</img>\nFrame14: <img>{IMAGE_TOKEN}</img>\nFrame15: <img>{IMAGE_TOKEN}</img>\nFrame16: <img>{IMAGE_TOKEN}</img>\nFrame17: <img>{IMAGE_TOKEN}</img>\nFrame18: <img>{IMAGE_TOKEN}</img>\nFrame19: <img>{IMAGE_TOKEN}</img>\nFrame20: <img>{IMAGE_TOKEN}</img>\nFrame21: <img>{IMAGE_TOKEN}</img>\nFrame22: <img>{IMAGE_TOKEN}</img>\nFrame23: <img>{IMAGE_TOKEN}</img>\nFrame24: <img>{IMAGE_TOKEN}</img>\nIs a person in the video?Answer Yes or No in one word."

model_path = '/media/star/disk2/pretrained_model/InternVL/InternVL2/InternVL2-1B'
model = pipeline(model_path, backend_config=TurbomindEngineConfig(session_len=65536))
start = time.time()
inter_num=10
for _ in range(inter_num):
    response_l = model([(question, pixel_values_list)], gen_config=GenerationConfig(max_new_tokens=1024, top_p=1))
    print(response_l[0].text)
print(f"run time is {round((time.time() - start)/10,2)} seconds")

1

image.zip

AmazDeng avatar Sep 03 '24 05:09 AmazDeng

@AmazDeng

The test_code.py is different from the code you post before. In your previous code you set max_dynamic_patch. I test InternVL2-1B and InternVL2-2B with your previous code.

I am not sure the actual patch size for each image. If the patch size is large, the time of vision model may be large. You can enable logging by pipe = pipeline(..., log_level='INFO') and see the logs.

There will be lines like

2024-09-05 13:04:11,044 - lmdeploy - INFO - ImageEncoder received 24 images, left 24 images.
2024-09-05 13:04:11,044 - lmdeploy - INFO - ImageEncoder process 1 images, left 23 images.
2024-09-05 13:04:11,149 - lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.105s
2024-09-05 13:04:11,149 - lmdeploy - INFO - ImageEncoder process 1 images, left 22 images.
2024-09-05 13:04:11,251 - lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.102s
2024-09-05 13:04:11,251 - lmdeploy - INFO - ImageEncoder process 1 images, left 21 images.
2024-09-05 13:04:11,345 - lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.094s
2024-09-05 13:04:11,346 - lmdeploy - INFO - ImageEncoder process 1 images, left 20 images.
2024-09-05 13:04:11,455 - lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.109s
2024-09-05 13:04:11,455 - lmdeploy - INFO - ImageEncoder process 1 images, left 19 images.
2024-09-05 13:04:11,554 - lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.099s
2024-09-05 13:04:11,554 - lmdeploy - INFO - ImageEncoder process 1 images, left 18 images.

irexyc avatar Sep 05 '24 13:09 irexyc

@AmazDeng

The test_code.py is different from the code you post before. In your previous code you set max_dynamic_patch. I test InternVL2-1B and InternVL2-2B with your previous code.

I am not sure the actual patch size for each image. If the patch size is large, the time of vision model may be large. You can enable logging by pipe = pipeline(..., log_level='INFO') and see the logs.

There will be lines like

2024-09-05 13:04:11,044 - lmdeploy - INFO - ImageEncoder received 24 images, left 24 images.
2024-09-05 13:04:11,044 - lmdeploy - INFO - ImageEncoder process 1 images, left 23 images.
2024-09-05 13:04:11,149 - lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.105s
2024-09-05 13:04:11,149 - lmdeploy - INFO - ImageEncoder process 1 images, left 22 images.
2024-09-05 13:04:11,251 - lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.102s
2024-09-05 13:04:11,251 - lmdeploy - INFO - ImageEncoder process 1 images, left 21 images.
2024-09-05 13:04:11,345 - lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.094s
2024-09-05 13:04:11,346 - lmdeploy - INFO - ImageEncoder process 1 images, left 20 images.
2024-09-05 13:04:11,455 - lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.109s
2024-09-05 13:04:11,455 - lmdeploy - INFO - ImageEncoder process 1 images, left 19 images.
2024-09-05 13:04:11,554 - lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.099s
2024-09-05 13:04:11,554 - lmdeploy - INFO - ImageEncoder process 1 images, left 18 images.

@irexyc I carefully tested the two pieces of code I uploaded. The first code performs inference according to the multi-frame paradigm provided by InternVL2's official implementation(https://internvl.readthedocs.io/en/latest/internvl2.0/deployment.html), while the second follows the video paradigm as specified by lmdeploy's official documentation(https://lmdeploy.readthedocs.io/en/latest/multi_modal/internvl.html).

  1. The multi-frame paradigm takes a long time for inference, up to 50 seconds. The video paradigm inference time is normal and similar to your test results. For 40B-AWQ, inference on a single video takes about 7 seconds.

  2. By checking the log files, I found that the ImageEncoder processing time for images in the multi-frame paradigm is more than twice that of the video paradigm. Additionally, the multi-frame paradigm requires setting session_len=65536 (which is not needed in the video paradigm), causing the steps in lmdeploy to take longer.

multi-frame paradigm

2024-09-06 11:00:50,221 - lmdeploy - INFO - ImageEncoder received 24 images, left 24 images.
2024-09-06 11:00:50,221 - lmdeploy - INFO - ImageEncoder process 1 images, left 23 images.
2024-09-06 11:00:50,362 - lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.140s
2024-09-06 11:00:50,362 - lmdeploy - INFO - ImageEncoder process 1 images, left 22 images.
2024-09-06 11:00:50,501 - lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.139s
2024-09-06 11:00:50,502 - lmdeploy - INFO - ImageEncoder process 1 images, left 21 images.

[TM][INFO] ------------------------- step = 55620 -------------------------
[TM][INFO] ------------------------- step = 55630 -------------------------
[TM][INFO] ------------------------- step = 55640 -------------------------
[TM][INFO] ------------------------- step = 55650 -------------------------
[TM][INFO] ------------------------- step = 55660 -------------------------
[TM][INFO] ------------------------- step = 55670 -------------------------
[TM][INFO] ------------------------- step = 55680 -------------------------
[TM][INFO] ------------------------- step = 55690 -------------------------

video paradigm

2024-09-06 11:36:34,196 - lmdeploy - INFO - ImageEncoder received 24 images, left 24 images.
2024-09-06 11:36:34,196 - lmdeploy - INFO - ImageEncoder process 1 images, left 23 images.
2024-09-06 11:36:34,268 - lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.071s
2024-09-06 11:36:34,268 - lmdeploy - INFO - ImageEncoder process 1 images, left 22 images.
2024-09-06 11:36:34,338 - lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.070s
2024-09-06 11:36:34,339 - lmdeploy - INFO - ImageEncoder process 1 images, left 21 images.
2024-09-06 11:36:34,411 - lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.071s
2024-09-06 11:36:34,411 - lmdeploy - INFO - ImageEncoder process 1 images, left 20 images.
2024-09-06 11:36:34,479 - lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.068s
2024-09-06 11:36:34,480 - lmdeploy - INFO - ImageEncoder process 1 images, left 19 images.
2024-09-06 11:36:34,551 - lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.071s

[TM][INFO] ------------------------- step = 6370 -------------------------
[TM][INFO] ------------------------- step = 6380 -------------------------
[TM][INFO] ------------------------- step = 6390 -------------------------
[TM][INFO] ------------------------- step = 6400 -------------------------
[TM][INFO] ------------------------- step = 6410 -------------------------
[TM][INFO] ------------------------- step = 6420 -------------------------
[TM][INFO] ------------------------- step = 6430 -------------------------
[TM][INFO] ------------------------- step = 6440 -------------------------
  1. I also noticed that in lmdeploy, the image processing by the ImageEncoder seems to be done serially. If my assumption is correct, this part of the logic could actually be optimized for multi-threading using a C++ shared object (SO) file. Of course, I need to check the code to confirm if my assumption is correct.

  2. I feel that the official instructions provided by InternVL2 might mislead users, especially for video inference. Moreover, when I used the multi-frame paradigm, the results were incorrect(there is only one person in the video!), while the video paradigm produced the correct results. If what I'm saying is accurate, the InternVL2 team might consider revising the instructions.

prompt:Describe this video in detail.

multi-frame paradigm

The video features two women standing in an indoor setting, each holding a sign with text in Chinese. The first woman is wearing a green jacket and has long dark hair, while the second woman is wearing a yellow jacket and has long black hair. Both women are looking directly at the camera with serious expressions.

The first woman is holding a sign with the text "鸭鸭男装官方旗舰店" (Duckman's Men's Clothing Official Store) and the second woman is holding a sign with the text "鸭鸭男装官方旗舰店" (Duckman's Men's Clothing Official Store) as well. The signs have a similar design, with the text in Chinese characters and a logo at the top.

The background of the video shows a modern, minimalist interior with large, gray marble columns and some green plants hanging from the ceiling. The lighting is bright and artificial, creating a clean and professional atmosphere. The women appear to be standing in a lobby or entrance area of a building, possibly a shopping center or a department store.

The video seems to be a promotional or advertising video for the Duckman's Men's Clothing Official Store, showcasing the store's branding and the products they offer. The women are likely trying to convey a sense of professionalism and reliability, as they are both dressed in business attire and holding signs that clearly state the store's name.

video paradigm

The video features a woman standing in a modern, minimalist building with a large, rectangular sign in front of her. The sign is predominantly white with black and red text, and it reads "鸭鸭男装官方旗舰店" in both Chinese and English. The woman is holding the sign with both hands, and she is wearing a green jacket with a hood. She has long, dark hair and is looking directly at the camera with a serious expression.\n\nThe background of the video is a modern, sleek interior with large, gray marble walls and a few plants hanging from the ceiling. The lighting is bright and evenly distributed, creating a clean and professional atmosphere. The woman appears to be standing in a spacious area, possibly a lobby or a waiting area, as there are no other people visible in the shot.\n\nThroughout the video, the woman remains in the same position, holding the sign and looking directly at the camera. She does not move or interact with anything in the scene. The video seems to be a promotional or advertising shot for the "鸭鸭男装官方旗舰店," which is likely a clothing store or brand. The overall tone of the video is professional and focused on presenting the store\'s brand and offerings clearly and effectively

I am now planning to study the source code to see if I can further accelerate the inference process. I have two questions: 1.is the lmdeploy code fully open-source, or is it only partially open-source, similar to TensorRT-LLM? 2.Can you provide an example of batch inference? This website(https://lmdeploy.readthedocs.io/en/latest/multi_modal/internvl.html) does not provide a method for batch inference. I tried duplicating the elements in 'messages' myself, but it still gives me an error

wrong batch inference code

"""
https://lmdeploy.readthedocs.io/en/latest/multi_modal/internvl.html
"""
import numpy as np
from lmdeploy import pipeline, GenerationConfig
from decord import VideoReader, cpu
from lmdeploy.vl.constants import IMAGE_TOKEN
from lmdeploy.vl.utils import encode_image_base64
from PIL import Image
import time
from lmdeploy import TurbomindEngineConfig

pipe = pipeline('/media/star/disk2/pretrained_model/InternVL/InternVL2/InternVL2-2B', backend_config=TurbomindEngineConfig(session_len=8192), log_level='ERROR')


def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
    if bound:
        start, end = bound[0], bound[1]
    else:
        start, end = -100000, 100000
    start_idx = max(first_idx, round(start * fps))
    end_idx = min(round(end * fps), max_frame)
    seg_size = float(end_idx - start_idx) / num_segments
    frame_indices = np.array([
        int(start_idx + (seg_size / 2) + np.round(seg_size * idx))
        for idx in range(num_segments)
    ])
    return frame_indices


def load_video(video_path, bound=None, num_segments=32):
    vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
    max_frame = len(vr) - 1
    fps = float(vr.get_avg_fps())
    pixel_values_list, num_patches_list = [], []
    frame_indices = get_index(bound, fps, max_frame, first_idx=0, num_segments=num_segments)
    imgs = []
    for frame_index in frame_indices:
        img = Image.fromarray(vr[frame_index].asnumpy()).convert('RGB')
        imgs.append(img)
    return imgs


video_path = '/media/star/8T/tmp/1.mp4'
imgs = load_video(video_path, num_segments=24)

question = ''
for i in range(len(imgs)):
    question = question + f'Frame{i+1}: <img>{IMAGE_TOKEN}</img>\n'

question += 'Is there a person in the video?Answer Yes or No in one word.'
print(f"question={question}")

content = [{'type': 'text', 'text': question}]
for img in imgs:
    content.append({'type': 'image_url', 'image_url': {'max_dynamic_patch': 1, 'url': f'data:image/jpeg;base64,{encode_image_base64(img)}'}})

messages = [dict(role='user', content=content)]*2 #duplicate


start=time.time()
for index in range(5):
    out = pipe(messages, gen_config=GenerationConfig(top_k=1))
    print(f"out={out}")
print(f"average run time is {(time.time()-start)/5} seconds")

AmazDeng avatar Sep 06 '24 06:09 AmazDeng

1 & 2

The internvl docs doesn't set max_dynamic_patch, one patch for a image will cost 256 tokens. A image will cost 256 * 13 tokens if it has 13 patched. And if input_tokens is large, the prefill stage will cost much time. In your log, you can see large step = 55620 as input_tokens is large.

3

By default, the vision part has running batch of 1 image. You may enlarge the default value(pipe = pipeline(..., vision_config=VisionConfig(max_batch_size=4) but currently we found that the benefits of batching vision model are not significant with pytorch backend.

1.is the lmdeploy code fully open-source, or is it only partially open-source, similar to TensorRT-LLM?

full open-source.

2.Can you provide an example of batch inference?

out = pipe([messages, messages], gen_config=GenerationConfig(top_k=1)) will do batching inference. You may see logs like [TM][INFO] [Forward] [0, 2), dc=1, pf=1, sum_q=5433, sum_k=5432, max_q=5432, max_k=5451 (batch is two)

But there are something you should know. The full pipeline of vlm is vision + llm. The vision part is time consuming if one request have many image input. If you wan't to batching two request in llm, the first request shouln't complete the llm part before the second request complete the vision part.

I recommand using multi-thread to do the batching inference with pipeline.stream_infer api and each thread do one request, the engine will automatically batch llm part if the above conditions are met. The reson why I recommand this way is that different request may need different time to complete. And if you give two requests in one function call, you need to wait for both requests to complete.

irexyc avatar Sep 06 '24 08:09 irexyc

I recommand using multi-thread to do the batching inference with pipeline.stream_infer api and each thread do one request, the engine will automatically batch llm part if the above conditions are met. The reson why I recommand this way is that different request may need different time to complete. And if you give two requests in one function call, you need to wait for both requests to complete.

@irexyc Does this sentence mean:

  1. Using multithreading to simultaneously request a service?
  2. Starting a thread pool in a service, with each thread handling a request (batch=1)? For instance, 8 video inferences are sent in one request, and then the model processes each video inference in 8 threads, with the final results combined? Is the model's handling of multithreaded inference thread-safe?

which interpretation above is correct?

This is a sample code I wrote. I'm not sure if it's correct. If it's not, could you provide a sample code to illustrate?

example code

def infer_video(request_json_data: Dict,model):
    res=[]
    with ThreadPoolExecutor(max_workers=4) as executor:
        futures = [executor.submit(infer_single_video, video_url,prompt,model) for (video_url,prompt) in request_json_data["video_urls_prompt"]]
        for future in futures:
            res.append(future.result())  
    return res

def infer_single_video(video_url: str,prompt, model):
    # infer single video
    model.stream_infer(...)

if __name__ == '__main__':
	request_json_data={...}
	model = pipeline(model_path, backend_config=TurbomindEngineConfig(session_len=8192), log_level='ERROR')

	res=infer_video(request_json_data,model)

AmazDeng avatar Sep 06 '24 11:09 AmazDeng

In my opinion, using multithreading or threadpool are the same thing. The truth of the matter is that each thread processes one request

It is worth noting that in your code, you do not take advantage of the streaming feature of stream_infer api. If you doesn't need streaming output, you can just use model.(...)

irexyc avatar Sep 06 '24 12:09 irexyc

In my opinion, using multithreading or threadpool are the same thing. The truth of the matter is that each thread processes one request

It is worth noting that in your code, you do not take advantage of the streaming feature of stream_infer api. If you doesn't need streaming output, you can just use model.(...)

@irexyc Alright, I will conduct tests on the multithreaded code later. If there are any questions, I will continue to communicate with you. Thank you for your consistent support and detailed answers.

AmazDeng avatar Sep 09 '24 02:09 AmazDeng

@irexyc I wanna do offline video batch inference with InternVL2-40B-AWQ+lmdeploy similar to @AmazDeng . According to your conversation context, May I assume that if the streaming feature is not utilized, the speed of sending requests in parallel using multithreading is close to that of native batch inference?

hkunzhe avatar Sep 23 '24 09:09 hkunzhe

@hkunzhe

The streaming feature has little impact on performance.

Using multi-thread have one advantage. Since each request has different input / output token length, It takes different amout of time. If a request is finished, we can start a new thread and process the next request. But if you using batch api, you have to wait all requests be finished before you send next batch.

irexyc avatar Sep 23 '24 09:09 irexyc

@irexyc Thanks for your quick reply! BTW, when using InternVL2-40B-AWQ+lmdeploy, I noticed that

Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes. No dtype was provided, you should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator.

Will this affect the speed of inference? My environment:

lmdeploy                  0.6.0
torch                     2.3.1+cu118
transformers              4.37.2

hkunzhe avatar Sep 23 '24 09:09 hkunzhe

@hkunzhe For the vision part of vlm models, we reuse transformers code to do the inference with torch.float16 dtype. And for llm part, we won't use flash-attn package. You can ignore this warning.

irexyc avatar Sep 23 '24 10:09 irexyc

Multi-threading is clearly faster than the native batch inference and also has a higher GPU utilization. @AmazDeng provided a nice demo code, but need to add VisionConfig(thread_safe=True) when initializing the pipeline and use concurrent.futures.as_completed to handle asynchronous tasks.

hkunzhe avatar Sep 24 '24 01:09 hkunzhe

我想使用 InternVL2-40B-AWQ+lmdeploy 进行离线视频批量推理,类似于 。根据您的对话上下文,我是否可以假设,如果不使用流式处理功能,使用多线程并行发送请求的速度接近本机批量推理的速度?

我想问一下, InternVL2-40B-AWQ启动大概需要多大的显存?

FAFUuser avatar Dec 04 '24 06:12 FAFUuser

我想使用 InternVL2-40B-AWQ+lmdeploy 进行离线视频批量推理,类似于 。根据您的对话上下文,我是否可以假设,如果不使用流式处理功能,使用多线程并行发送请求的速度接近本机批量推理的速度?

我想问一下, InternVL2-40B-AWQ启动大概需要多大的显存?

需要A100 80G版本,可以启动

AmazDeng avatar Dec 04 '24 06:12 AmazDeng