InternVL2-40B-AWQ+lmdeploy video infer speed is very slow
I am using the InternVL2-40B-AWQ model and performing video inference according to the multi-images inference paradigm. Each video is sampled into 24 frames, and the prompt is as follows. My questions are:
1.When performing video inference, is the video processed according to the multi-images paradigm? Is this method correct? Is there a specific prompt format for video inference? On the official website(https://internvl.readthedocs.io/en/latest/internvl2.0/deployment.html#), I only see inference for single images and multi-images, but I haven't found inference code specifically for videos.
2.During video inference (following the multi-images paradigm) with num_segments=24, I encounter an error: lmdeploy - ERROR - Truncate max_new_tokens to 128 when session_len=8192. Through repeated testing, I found that session_len=65536 is necessary for correct inference. I want to ask, how many tokens does a single image actually consume before being input to the language model? According to this issue, the model's max_seq_length is at most 8192 (https://github.com/OpenGVLab/InternVL/issues/381), so why can I set session_len=65536? The related link is: https://github.com/InternLM/lmdeploy/issues/2382.
3.During video inference (following the multi-images paradigm) with num_segments=24, I noticed that inference with lmdeploy takes about 60 seconds, while transformers inference takes around 9 seconds. Why is this? Is there room for optimization? Why is the inference speed of lmdeploy significantly slower than that of PyTorch?
code demo
from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image
from lmdeploy import GenerationConfig
import time
target_images=["2.jpg","6.jpg","11.jpg","15.jpg","20.jpg","24.jpg","28.jpg","33.jpg","37.jpg","41.jpg","46.jpg","50.jpg","54.jpg","59.jpg","63.jpg","68.jpg","72.jpg","76.jpg","81.jpg","85.jpg","90.jpg","94.jpg","98.jpg","103.jpg"]
# 2, 6, 11, 15, 20, 24, 28, 33, 37, 41, 46, 50, 54,
# 59, 63, 68, 72, 76, 81, 85, 90, 94, 98, 103
pixel_values_list=[]
for ele in target_images:
img=load_image(f"./image/{ele}")
pixel_values_list.append(img)
question="Frame1: <img>{IMAGE_TOKEN}</img>\nFrame2: <img>{IMAGE_TOKEN}</img>\nFrame3: <img>{IMAGE_TOKEN}</img>\nFrame4: <img>{IMAGE_TOKEN}</img>\nFrame5: <img>{IMAGE_TOKEN}</img>\nFrame6: <img>{IMAGE_TOKEN}</img>\nFrame7: <img>{IMAGE_TOKEN}</img>\nFrame8: <img>{IMAGE_TOKEN}</img>\nFrame9: <img>{IMAGE_TOKEN}</img>\nFrame10: <img>{IMAGE_TOKEN}</img>\nFrame11: <img>{IMAGE_TOKEN}</img>\nFrame12: <img>{IMAGE_TOKEN}</img>\nFrame13: <img>{IMAGE_TOKEN}</img>\nFrame14: <img>{IMAGE_TOKEN}</img>\nFrame15: <img>{IMAGE_TOKEN}</img>\nFrame16: <img>{IMAGE_TOKEN}</img>\nFrame17: <img>{IMAGE_TOKEN}</img>\nFrame18: <img>{IMAGE_TOKEN}</img>\nFrame19: <img>{IMAGE_TOKEN}</img>\nFrame20: <img>{IMAGE_TOKEN}</img>\nFrame21: <img>{IMAGE_TOKEN}</img>\nFrame22: <img>{IMAGE_TOKEN}</img>\nFrame23: <img>{IMAGE_TOKEN}</img>\nFrame24: <img>{IMAGE_TOKEN}</img>\nIs a person in the video?Answer Yes or No in one word."
model_path = '/media/star/disk2/pretrained_model/InternVL/InternVL2/InternVL2-40B-AWQ'
#model_path = '/media/star/disk2/pretrained_model/InternVL/InternVL2/InternVL2-1B'
model = pipeline(model_path, backend_config=TurbomindEngineConfig(session_len=65536))
start=time.time()
response_l = model([(question, pixel_values_list)],gen_config=GenerationConfig(max_new_tokens=1024,top_p=1))
print(response_l[0].text)
print(f"run time is {time.time()-start}")
images image.zip
@czczup @whai362 @ErfeiCui @hjh0119 @lvhan028 @Adushar @Weiyun1025 @cg1177 @opengvlab-admin @qishisuren123 @dlutwy Could you please take a look at this issue?
@AmazDeng
For video understand with internvl2, you can refer this docs (the video multi-round conversation part)
To see how many input token used, you can turn on the log with pipe = pipeline('...', log_level='INFO'). For internvl2, single image with one patch will use 256 tokens. By default, a single image will have 1~13 patches depending on its aspect ratio. Reducing the max_dynamic_patch will speed up the inference and the official video demo also use one patch for single image in video clip.
related issue https://github.com/InternLM/lmdeploy/issues/2260
@irexyc
The example of video multi-round conversation and the provided code are quite similar. I also conducted tests following the video multi-round conversation example using InternVL2-1B. The inference time was also approximately 40 seconds. This indicates that the inference time with lmdeploy is significantly slower than with transformers.
I've uploaded the video and code, you can also test it on your own machine.
test code
import numpy as np
from lmdeploy import pipeline, GenerationConfig
from decord import VideoReader, cpu
from lmdeploy.vl.constants import IMAGE_TOKEN
from lmdeploy.vl.utils import encode_image_base64
from PIL import Image
import time
from lmdeploy import TurbomindEngineConfig
pipe = pipeline('/media/star/disk2/pretrained_model/InternVL/InternVL2/InternVL2-1B',backend_config=TurbomindEngineConfig(session_len=65536))
def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
if bound:
start, end = bound[0], bound[1]
else:
start, end = -100000, 100000
start_idx = max(first_idx, round(start * fps))
end_idx = min(round(end * fps), max_frame)
seg_size = float(end_idx - start_idx) / num_segments
frame_indices = np.array([
int(start_idx + (seg_size / 2) + np.round(seg_size * idx))
for idx in range(num_segments)
])
return frame_indices
def load_video(video_path, bound=None, num_segments=32):
vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
max_frame = len(vr) - 1
fps = float(vr.get_avg_fps())
pixel_values_list, num_patches_list = [], []
frame_indices = get_index(bound, fps, max_frame, first_idx=0, num_segments=num_segments)
imgs = []
for frame_index in frame_indices:
img = Image.fromarray(vr[frame_index].asnumpy()).convert('RGB')
imgs.append(img)
return imgs
video_path = '/media/star/8T/tmp/1.mp4'
imgs = load_video(video_path, num_segments=24)
question = ''
for i in range(len(imgs)):
question = question + f'Frame{i+1}: <img>{IMAGE_TOKEN}</img>\n'
question += 'Is there a person in the video?Answer Yes or No in one word.'
print(f"question={question}")
content = [{'type': 'text', 'text': question}]
for img in imgs:
content.append({'type': 'image_url', 'image_url': {'max_dynamic_patch': 1, 'url': f'data:image/jpeg;base64,{encode_image_base64(img)}'}})
messages = [dict(role='user', content=content)]
out = pipe(messages, gen_config=GenerationConfig(top_k=1))
messages.append(dict(role='assistant', content=out.text))
messages.append(dict(role='user', content='Describe this video in detail. Don\'t repeat.'))
start=time.time()
for index in range(5):
out = pipe(messages, gen_config=GenerationConfig(top_k=1))
print(f"out={out}")
print(f"average run time is {(time.time()-start)/5} seconds")
https://github.com/user-attachments/assets/25bbb29a-28c0-498c-9ff9-78c050b9c3c6
In my test, InternVL2-1B (which can not turbomind backend and will fallback to pytorch backend) takes an average of 5.4s. InternVL2-2B takes an average of 1.8s.
In my test, InternVL2-1B (which can not turbomind backend and will fallback to pytorch backend) takes an average of 5.4s. InternVL2-2B takes an average of 1.8s.
@irexyc May I ask if you are testing the code and images that I provided before? It takes 50 seconds for the InternVL2-1B I tested to run. What could be the reason for this? test code
from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image
from lmdeploy import GenerationConfig
import time
target_images = ["2.jpg", "6.jpg", "11.jpg", "15.jpg", "20.jpg", "24.jpg", "28.jpg", "33.jpg", "37.jpg", "41.jpg",
"46.jpg", "50.jpg", "54.jpg", "59.jpg", "63.jpg", "68.jpg", "72.jpg", "76.jpg", "81.jpg", "85.jpg",
"90.jpg", "94.jpg", "98.jpg", "103.jpg"]
pixel_values_list = []
for ele in target_images:
img = load_image(f"./image/{ele}")
pixel_values_list.append(img)
question = "Frame1: <img>{IMAGE_TOKEN}</img>\nFrame2: <img>{IMAGE_TOKEN}</img>\nFrame3: <img>{IMAGE_TOKEN}</img>\nFrame4: <img>{IMAGE_TOKEN}</img>\nFrame5: <img>{IMAGE_TOKEN}</img>\nFrame6: <img>{IMAGE_TOKEN}</img>\nFrame7: <img>{IMAGE_TOKEN}</img>\nFrame8: <img>{IMAGE_TOKEN}</img>\nFrame9: <img>{IMAGE_TOKEN}</img>\nFrame10: <img>{IMAGE_TOKEN}</img>\nFrame11: <img>{IMAGE_TOKEN}</img>\nFrame12: <img>{IMAGE_TOKEN}</img>\nFrame13: <img>{IMAGE_TOKEN}</img>\nFrame14: <img>{IMAGE_TOKEN}</img>\nFrame15: <img>{IMAGE_TOKEN}</img>\nFrame16: <img>{IMAGE_TOKEN}</img>\nFrame17: <img>{IMAGE_TOKEN}</img>\nFrame18: <img>{IMAGE_TOKEN}</img>\nFrame19: <img>{IMAGE_TOKEN}</img>\nFrame20: <img>{IMAGE_TOKEN}</img>\nFrame21: <img>{IMAGE_TOKEN}</img>\nFrame22: <img>{IMAGE_TOKEN}</img>\nFrame23: <img>{IMAGE_TOKEN}</img>\nFrame24: <img>{IMAGE_TOKEN}</img>\nIs a person in the video?Answer Yes or No in one word."
model_path = '/media/star/disk2/pretrained_model/InternVL/InternVL2/InternVL2-1B'
model = pipeline(model_path, backend_config=TurbomindEngineConfig(session_len=65536))
start = time.time()
inter_num=10
for _ in range(inter_num):
response_l = model([(question, pixel_values_list)], gen_config=GenerationConfig(max_new_tokens=1024, top_p=1))
print(response_l[0].text)
print(f"run time is {round((time.time() - start)/10,2)} seconds")
@AmazDeng
The test_code.py is different from the code you post before. In your previous code you set max_dynamic_patch. I test InternVL2-1B and InternVL2-2B with your previous code.
I am not sure the actual patch size for each image. If the patch size is large, the time of vision model may be large. You can enable logging by pipe = pipeline(..., log_level='INFO') and see the logs.
There will be lines like
2024-09-05 13:04:11,044 - lmdeploy - INFO - ImageEncoder received 24 images, left 24 images.
2024-09-05 13:04:11,044 - lmdeploy - INFO - ImageEncoder process 1 images, left 23 images.
2024-09-05 13:04:11,149 - lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.105s
2024-09-05 13:04:11,149 - lmdeploy - INFO - ImageEncoder process 1 images, left 22 images.
2024-09-05 13:04:11,251 - lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.102s
2024-09-05 13:04:11,251 - lmdeploy - INFO - ImageEncoder process 1 images, left 21 images.
2024-09-05 13:04:11,345 - lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.094s
2024-09-05 13:04:11,346 - lmdeploy - INFO - ImageEncoder process 1 images, left 20 images.
2024-09-05 13:04:11,455 - lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.109s
2024-09-05 13:04:11,455 - lmdeploy - INFO - ImageEncoder process 1 images, left 19 images.
2024-09-05 13:04:11,554 - lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.099s
2024-09-05 13:04:11,554 - lmdeploy - INFO - ImageEncoder process 1 images, left 18 images.
@AmazDeng
The test_code.py is different from the code you post before. In your previous code you set
max_dynamic_patch. I test InternVL2-1B and InternVL2-2B with your previous code.I am not sure the actual patch size for each image. If the patch size is large, the time of vision model may be large. You can enable logging by
pipe = pipeline(..., log_level='INFO')and see the logs.There will be lines like
2024-09-05 13:04:11,044 - lmdeploy - INFO - ImageEncoder received 24 images, left 24 images. 2024-09-05 13:04:11,044 - lmdeploy - INFO - ImageEncoder process 1 images, left 23 images. 2024-09-05 13:04:11,149 - lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.105s 2024-09-05 13:04:11,149 - lmdeploy - INFO - ImageEncoder process 1 images, left 22 images. 2024-09-05 13:04:11,251 - lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.102s 2024-09-05 13:04:11,251 - lmdeploy - INFO - ImageEncoder process 1 images, left 21 images. 2024-09-05 13:04:11,345 - lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.094s 2024-09-05 13:04:11,346 - lmdeploy - INFO - ImageEncoder process 1 images, left 20 images. 2024-09-05 13:04:11,455 - lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.109s 2024-09-05 13:04:11,455 - lmdeploy - INFO - ImageEncoder process 1 images, left 19 images. 2024-09-05 13:04:11,554 - lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.099s 2024-09-05 13:04:11,554 - lmdeploy - INFO - ImageEncoder process 1 images, left 18 images.
@irexyc I carefully tested the two pieces of code I uploaded. The first code performs inference according to the multi-frame paradigm provided by InternVL2's official implementation(https://internvl.readthedocs.io/en/latest/internvl2.0/deployment.html), while the second follows the video paradigm as specified by lmdeploy's official documentation(https://lmdeploy.readthedocs.io/en/latest/multi_modal/internvl.html).
-
The multi-frame paradigm takes a long time for inference, up to 50 seconds. The video paradigm inference time is normal and similar to your test results. For 40B-AWQ, inference on a single video takes about 7 seconds.
-
By checking the log files, I found that the ImageEncoder processing time for images in the multi-frame paradigm is more than twice that of the video paradigm. Additionally, the multi-frame paradigm requires setting
session_len=65536(which is not needed in the video paradigm), causing the steps in lmdeploy to take longer.
multi-frame paradigm
2024-09-06 11:00:50,221 - lmdeploy - INFO - ImageEncoder received 24 images, left 24 images.
2024-09-06 11:00:50,221 - lmdeploy - INFO - ImageEncoder process 1 images, left 23 images.
2024-09-06 11:00:50,362 - lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.140s
2024-09-06 11:00:50,362 - lmdeploy - INFO - ImageEncoder process 1 images, left 22 images.
2024-09-06 11:00:50,501 - lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.139s
2024-09-06 11:00:50,502 - lmdeploy - INFO - ImageEncoder process 1 images, left 21 images.
[TM][INFO] ------------------------- step = 55620 -------------------------
[TM][INFO] ------------------------- step = 55630 -------------------------
[TM][INFO] ------------------------- step = 55640 -------------------------
[TM][INFO] ------------------------- step = 55650 -------------------------
[TM][INFO] ------------------------- step = 55660 -------------------------
[TM][INFO] ------------------------- step = 55670 -------------------------
[TM][INFO] ------------------------- step = 55680 -------------------------
[TM][INFO] ------------------------- step = 55690 -------------------------
video paradigm
2024-09-06 11:36:34,196 - lmdeploy - INFO - ImageEncoder received 24 images, left 24 images.
2024-09-06 11:36:34,196 - lmdeploy - INFO - ImageEncoder process 1 images, left 23 images.
2024-09-06 11:36:34,268 - lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.071s
2024-09-06 11:36:34,268 - lmdeploy - INFO - ImageEncoder process 1 images, left 22 images.
2024-09-06 11:36:34,338 - lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.070s
2024-09-06 11:36:34,339 - lmdeploy - INFO - ImageEncoder process 1 images, left 21 images.
2024-09-06 11:36:34,411 - lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.071s
2024-09-06 11:36:34,411 - lmdeploy - INFO - ImageEncoder process 1 images, left 20 images.
2024-09-06 11:36:34,479 - lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.068s
2024-09-06 11:36:34,480 - lmdeploy - INFO - ImageEncoder process 1 images, left 19 images.
2024-09-06 11:36:34,551 - lmdeploy - INFO - ImageEncoder forward 1 images, cost 0.071s
[TM][INFO] ------------------------- step = 6370 -------------------------
[TM][INFO] ------------------------- step = 6380 -------------------------
[TM][INFO] ------------------------- step = 6390 -------------------------
[TM][INFO] ------------------------- step = 6400 -------------------------
[TM][INFO] ------------------------- step = 6410 -------------------------
[TM][INFO] ------------------------- step = 6420 -------------------------
[TM][INFO] ------------------------- step = 6430 -------------------------
[TM][INFO] ------------------------- step = 6440 -------------------------
-
I also noticed that in lmdeploy, the image processing by the ImageEncoder seems to be done serially. If my assumption is correct, this part of the logic could actually be optimized for multi-threading using a C++ shared object (SO) file. Of course, I need to check the code to confirm if my assumption is correct.
-
I feel that the official instructions provided by InternVL2 might mislead users, especially for video inference. Moreover, when I used the multi-frame paradigm, the results were incorrect(there is only one person in the video!), while the video paradigm produced the correct results. If what I'm saying is accurate, the InternVL2 team might consider revising the instructions.
prompt:Describe this video in detail.
multi-frame paradigm
The video features two women standing in an indoor setting, each holding a sign with text in Chinese. The first woman is wearing a green jacket and has long dark hair, while the second woman is wearing a yellow jacket and has long black hair. Both women are looking directly at the camera with serious expressions.
The first woman is holding a sign with the text "鸭鸭男装官方旗舰店" (Duckman's Men's Clothing Official Store) and the second woman is holding a sign with the text "鸭鸭男装官方旗舰店" (Duckman's Men's Clothing Official Store) as well. The signs have a similar design, with the text in Chinese characters and a logo at the top.
The background of the video shows a modern, minimalist interior with large, gray marble columns and some green plants hanging from the ceiling. The lighting is bright and artificial, creating a clean and professional atmosphere. The women appear to be standing in a lobby or entrance area of a building, possibly a shopping center or a department store.
The video seems to be a promotional or advertising video for the Duckman's Men's Clothing Official Store, showcasing the store's branding and the products they offer. The women are likely trying to convey a sense of professionalism and reliability, as they are both dressed in business attire and holding signs that clearly state the store's name.
video paradigm
The video features a woman standing in a modern, minimalist building with a large, rectangular sign in front of her. The sign is predominantly white with black and red text, and it reads "鸭鸭男装官方旗舰店" in both Chinese and English. The woman is holding the sign with both hands, and she is wearing a green jacket with a hood. She has long, dark hair and is looking directly at the camera with a serious expression.\n\nThe background of the video is a modern, sleek interior with large, gray marble walls and a few plants hanging from the ceiling. The lighting is bright and evenly distributed, creating a clean and professional atmosphere. The woman appears to be standing in a spacious area, possibly a lobby or a waiting area, as there are no other people visible in the shot.\n\nThroughout the video, the woman remains in the same position, holding the sign and looking directly at the camera. She does not move or interact with anything in the scene. The video seems to be a promotional or advertising shot for the "鸭鸭男装官方旗舰店," which is likely a clothing store or brand. The overall tone of the video is professional and focused on presenting the store\'s brand and offerings clearly and effectively
I am now planning to study the source code to see if I can further accelerate the inference process. I have two questions: 1.is the lmdeploy code fully open-source, or is it only partially open-source, similar to TensorRT-LLM? 2.Can you provide an example of batch inference? This website(https://lmdeploy.readthedocs.io/en/latest/multi_modal/internvl.html) does not provide a method for batch inference. I tried duplicating the elements in 'messages' myself, but it still gives me an error
wrong batch inference code
"""
https://lmdeploy.readthedocs.io/en/latest/multi_modal/internvl.html
"""
import numpy as np
from lmdeploy import pipeline, GenerationConfig
from decord import VideoReader, cpu
from lmdeploy.vl.constants import IMAGE_TOKEN
from lmdeploy.vl.utils import encode_image_base64
from PIL import Image
import time
from lmdeploy import TurbomindEngineConfig
pipe = pipeline('/media/star/disk2/pretrained_model/InternVL/InternVL2/InternVL2-2B', backend_config=TurbomindEngineConfig(session_len=8192), log_level='ERROR')
def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
if bound:
start, end = bound[0], bound[1]
else:
start, end = -100000, 100000
start_idx = max(first_idx, round(start * fps))
end_idx = min(round(end * fps), max_frame)
seg_size = float(end_idx - start_idx) / num_segments
frame_indices = np.array([
int(start_idx + (seg_size / 2) + np.round(seg_size * idx))
for idx in range(num_segments)
])
return frame_indices
def load_video(video_path, bound=None, num_segments=32):
vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
max_frame = len(vr) - 1
fps = float(vr.get_avg_fps())
pixel_values_list, num_patches_list = [], []
frame_indices = get_index(bound, fps, max_frame, first_idx=0, num_segments=num_segments)
imgs = []
for frame_index in frame_indices:
img = Image.fromarray(vr[frame_index].asnumpy()).convert('RGB')
imgs.append(img)
return imgs
video_path = '/media/star/8T/tmp/1.mp4'
imgs = load_video(video_path, num_segments=24)
question = ''
for i in range(len(imgs)):
question = question + f'Frame{i+1}: <img>{IMAGE_TOKEN}</img>\n'
question += 'Is there a person in the video?Answer Yes or No in one word.'
print(f"question={question}")
content = [{'type': 'text', 'text': question}]
for img in imgs:
content.append({'type': 'image_url', 'image_url': {'max_dynamic_patch': 1, 'url': f'data:image/jpeg;base64,{encode_image_base64(img)}'}})
messages = [dict(role='user', content=content)]*2 #duplicate
start=time.time()
for index in range(5):
out = pipe(messages, gen_config=GenerationConfig(top_k=1))
print(f"out={out}")
print(f"average run time is {(time.time()-start)/5} seconds")
1 & 2
The internvl docs doesn't set max_dynamic_patch, one patch for a image will cost 256 tokens. A image will cost 256 * 13 tokens if it has 13 patched. And if input_tokens is large, the prefill stage will cost much time. In your log, you can see large step = 55620 as input_tokens is large.
3
By default, the vision part has running batch of 1 image. You may enlarge the default value(pipe = pipeline(..., vision_config=VisionConfig(max_batch_size=4) but currently we found that the benefits of batching vision model are not significant with pytorch backend.
1.is the lmdeploy code fully open-source, or is it only partially open-source, similar to TensorRT-LLM?
full open-source.
2.Can you provide an example of batch inference?
out = pipe([messages, messages], gen_config=GenerationConfig(top_k=1)) will do batching inference. You may see logs like [TM][INFO] [Forward] [0, 2), dc=1, pf=1, sum_q=5433, sum_k=5432, max_q=5432, max_k=5451 (batch is two)
But there are something you should know. The full pipeline of vlm is vision + llm. The vision part is time consuming if one request have many image input. If you wan't to batching two request in llm, the first request shouln't complete the llm part before the second request complete the vision part.
I recommand using multi-thread to do the batching inference with pipeline.stream_infer api and each thread do one request, the engine will automatically batch llm part if the above conditions are met. The reson why I recommand this way is that different request may need different time to complete. And if you give two requests in one function call, you need to wait for both requests to complete.
I recommand using multi-thread to do the batching inference with pipeline.stream_infer api and each thread do one request, the engine will automatically batch llm part if the above conditions are met. The reson why I recommand this way is that different request may need different time to complete. And if you give two requests in one function call, you need to wait for both requests to complete.
@irexyc Does this sentence mean:
- Using multithreading to simultaneously request a service?
- Starting a thread pool in a service, with each thread handling a request (batch=1)? For instance, 8 video inferences are sent in one request, and then the model processes each video inference in 8 threads, with the final results combined? Is the model's handling of multithreaded inference thread-safe?
which interpretation above is correct?
This is a sample code I wrote. I'm not sure if it's correct. If it's not, could you provide a sample code to illustrate?
example code
def infer_video(request_json_data: Dict,model):
res=[]
with ThreadPoolExecutor(max_workers=4) as executor:
futures = [executor.submit(infer_single_video, video_url,prompt,model) for (video_url,prompt) in request_json_data["video_urls_prompt"]]
for future in futures:
res.append(future.result())
return res
def infer_single_video(video_url: str,prompt, model):
# infer single video
model.stream_infer(...)
if __name__ == '__main__':
request_json_data={...}
model = pipeline(model_path, backend_config=TurbomindEngineConfig(session_len=8192), log_level='ERROR')
res=infer_video(request_json_data,model)
In my opinion, using multithreading or threadpool are the same thing. The truth of the matter is that each thread processes one request
It is worth noting that in your code, you do not take advantage of the streaming feature of stream_infer api. If you doesn't need streaming output, you can just use model.(...)
In my opinion, using multithreading or threadpool are the same thing. The truth of the matter is that each thread processes one request
It is worth noting that in your code, you do not take advantage of the
streamingfeature ofstream_inferapi. If you doesn't need streaming output, you can just usemodel.(...)
@irexyc Alright, I will conduct tests on the multithreaded code later. If there are any questions, I will continue to communicate with you. Thank you for your consistent support and detailed answers.
@irexyc I wanna do offline video batch inference with InternVL2-40B-AWQ+lmdeploy similar to @AmazDeng . According to your conversation context, May I assume that if the streaming feature is not utilized, the speed of sending requests in parallel using multithreading is close to that of native batch inference?
@hkunzhe
The streaming feature has little impact on performance.
Using multi-thread have one advantage. Since each request has different input / output token length, It takes different amout of time. If a request is finished, we can start a new thread and process the next request. But if you using batch api, you have to wait all requests be finished before you send next batch.
@irexyc Thanks for your quick reply! BTW, when using InternVL2-40B-AWQ+lmdeploy, I noticed that
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes. No dtype was provided, you should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator.
Will this affect the speed of inference? My environment:
lmdeploy 0.6.0
torch 2.3.1+cu118
transformers 4.37.2
@hkunzhe For the vision part of vlm models, we reuse transformers code to do the inference with torch.float16 dtype. And for llm part, we won't use flash-attn package. You can ignore this warning.
Multi-threading is clearly faster than the native batch inference and also has a higher GPU utilization. @AmazDeng provided a nice demo code, but need to add VisionConfig(thread_safe=True) when initializing the pipeline and use concurrent.futures.as_completed to handle asynchronous tasks.
我想使用 InternVL2-40B-AWQ+lmdeploy 进行离线视频批量推理,类似于 。根据您的对话上下文,我是否可以假设,如果不使用流式处理功能,使用多线程并行发送请求的速度接近本机批量推理的速度?
我想问一下, InternVL2-40B-AWQ启动大概需要多大的显存?
我想使用 InternVL2-40B-AWQ+lmdeploy 进行离线视频批量推理,类似于 。根据您的对话上下文,我是否可以假设,如果不使用流式处理功能,使用多线程并行发送请求的速度接近本机批量推理的速度?
我想问一下, InternVL2-40B-AWQ启动大概需要多大的显存?
需要A100 80G版本,可以启动