InternLM-XComposer Inference time

Hi,

Thanks a lot for sharing this amazing work. I'm quite new to LLM and VLM, and I tried running your 'InternLM-XComposer-2.5' model. I followed the provided 'InternLM-XComposer-2.5 with Transformers, Multi-Image Multi-Tune Dialog' code exactly, using the example with 4 car images. However, it took a very long time to get the inference results.

My system specs are: 128GB RAM, Nvidia GeForce RTX 3090 GPU with 24GB VRAM. I am running the code on WSL. I noticed that my GPU's dedicated memory was almost fully utilized, and it was using nearly 50GB of shared GPU memory. Additionally, my RAM usage was around 30-50GB.

Given my system specifications, I'm wondering if it's normal for inference to take this long. The Hugging Face demo you provided outputs results much faster. Could you share the specifications used for that demo?

Thanks!

Jul 24 '24 06:07 vietpho

Our online demo runs on a single A100 GPU with 80GB of memory. We also tested the Multi-Image Multi-Tune Dialog scenario on a single A100 GPU with 80GB of memory using the script below.

import time

import nvidia_smi
import torch
from transformers import AutoModel, AutoTokenizer

nvidia_smi.nvmlInit()
handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0)
nvidia_smi.nvmlShutdown()

torch.set_grad_enabled(False)

# init model and tokenizer
model = AutoModel.from_pretrained('internlm/internlm-xcomposer2d5-7b', torch_dtype=torch.bfloat16, trust_remote_code=True).cuda().eval().half()
tokenizer = AutoTokenizer.from_pretrained('internlm/internlm-xcomposer2d5-7b', trust_remote_code=True)
model.tokenizer = tokenizer

query = 'Image1 <ImageHere>; Image2 <ImageHere>; Image3 <ImageHere>; I want to buy a car from the three given cars, analyze their advantages and weaknesses one by one'
image = ['./examples/cars1.jpg',
        './examples/cars2.jpg',
        './examples/cars3.jpg',]

start_time = time.time()
with torch.autocast(device_type='cuda', dtype=torch.float16):
    response, his = model.chat(tokenizer, query, image, do_sample=False, num_beams=3, use_meta=True)
print(response)
first_time = time.time()

query = 'Image4 <ImageHere>; How about the car in Image4'
image.append('./examples/cars4.jpg')
with torch.autocast(device_type='cuda', dtype=torch.float16):
    response, _ = model.chat(tokenizer, query, image, do_sample=False, num_beams=3, history= his, use_meta=True)
print(response)
second_time = time.time()

info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)
print("Total memory:", info.total / 1024 / 1024 / 1024)
print("Free memory:", info.free / 1024 / 1024 / 1024)
print("Used memory:", info.used / 1024 / 1024 / 1024)

print(f'Time of the first infer: {first_time - start_time:.3f}')
print(f'Time of the second infer: {second_time - first_time:.3f}')

The GPU memory and time are measured as follows:

Total memory: 80.0
Free memory: 26.06011962890625
Used memory: 53.93988037109375
Time of the first infer: 9.063
Time of the second infer: 35.399

We discovered that the version of the transformers package may impact memory usage and processing time. Please install version 4.33.1 of the transformers package using the following command:

pip install transformers==4.33.1

Jul 24 '24 06:07 yhcao6

Thanks a lot for your response! I have two more questions:

Do you know why the second inference takes longer than the first one?
What is the maximum image size that the model can accept as input? Does the model automatically resize images internally?

Jul 24 '24 08:07 vietpho

The second inference takes longer due to accumulated context (includes the history ("his") from the first query), additional image processing (adds a fourth image), and potential GPU memory management overhead.
The model can support 4k resolution and beyond, it depends on your GPU memory. For detailed information on the dynamic resolution mechanism, please refer to the paper: https://arxiv.org/pdf/2407.03320 .

Jul 24 '24 08:07 yhcao6

Hi, sorry to bother you again! My GPU memory is only 24GB, so I want to use the 4-bit model. I have read your instructions and tried to run inference with multi-image input using the 4-bit model based on the following two code blocks, but I encountered an error. (I have downloaded all the checkpoints and necessary files locally and both images and model checkpoints load correctly.)

Could you please provide an example inference code that works with multi-image input using the 4-bit model? Also, is it possible to use multi-turn dialog with the 4-bit model?

Here are the two code blocks I referenced:

from lmdeploy import TurbomindEngineConfig, pipeline
from lmdeploy.vl import load_image
engine_config = TurbomindEngineConfig(model_format='awq')
pipe = pipeline('internlm/internlm-xcomposer2d5-7b-4bit', backend_config=engine_config)
image = load_image('examples/dubai.png')
response = pipe(('describe this image', image))
print(response.text)

from lmdeploy import pipeline, GenerationConfig
from lmdeploy.vl.constants import IMAGE_TOKEN
from lmdeploy.vl import load_image

query = f'Image1 {IMAGE_TOKEN}; Image2 {IMAGE_TOKEN}; Image3 {IMAGE_TOKEN}; I want to buy a car from the three given cars, analyze their advantages and weaknesses one by one'

urls = [
    'https://raw.githubusercontent.com/InternLM/InternLM-XComposer/main/examples/cars1.jpg',
    'https://raw.githubusercontent.com/InternLM/InternLM-XComposer/main/examples/cars2.jpg',
    'https://raw.githubusercontent.com/InternLM/InternLM-XComposer/main/examples/cars3.jpg'
]
images = [load_image(url) for url in urls]

pipe = pipeline('internlm/internlm-xcomposer2d5-7b', log_level='INFO')
output = pipe((query, images), gen_config=GenerationConfig(top_k=0, top_p=0.8, random_seed=89247526689433939))

Here is the code I wrote based on the above two code blocks:

from lmdeploy import TurbomindEngineConfig, pipeline, GenerationConfig
from lmdeploy.vl import load_image
from lmdeploy.vl.constants import IMAGE_TOKEN
import os

os.chdir(os.path.dirname(os.path.abspath(__file__)))

model_path = '/workspace/InternLM-XComposer/xcomp_4bit_files'

engine_config = TurbomindEngineConfig(model_format='awq')
pipe = pipeline(model_path, backend_config=engine_config)

img_paths = [
    './examples/cars1.jpg',
    './examples/cars2.jpg',
    './examples/cars3.jpg'
]

images = [load_image(img_path) for img_path in img_paths]

query = f'Image1 {IMAGE_TOKEN}; Image2 {IMAGE_TOKEN}; Image3 {IMAGE_TOKEN}; I want to buy a car from the three given cars, analyze their advantages and weaknesses one by one'

output = pipe((query, images), gen_config=GenerationConfig(top_k=0, top_p=0.8, random_seed=89247526689433939))

print(output.text)

Here is the error I get when running the above code:

Traceback (most recent call last):
  File "/workspace/InternLM-XComposer/test_lmdeploy.py", line 42, in <module>
    output = pipe((query, images), gen_config=GenerationConfig(top_k=0, top_p=0.8, random_seed=89247526689433939))
  File "/usr/local/venv/lib/python3.9/site-packages/lmdeploy/serve/vl_async_engine.py", line 123, in __call__
    return super().__call__(prompts, **kwargs)
  File "/usr/local/venv/lib/python3.9/site-packages/lmdeploy/serve/async_engine.py", line 304, in __call__
    return self.batch_infer(prompts,
  File "/usr/local/venv/lib/python3.9/site-packages/lmdeploy/serve/vl_async_engine.py", line 109, in batch_infer
    return super().batch_infer(prompts, **kwargs)
  File "/usr/local/venv/lib/python3.9/site-packages/lmdeploy/serve/async_engine.py", line 428, in batch_infer
    _get_event_loop().run_until_complete(gather())
  File "/usr/lib/python3.9/asyncio/base_events.py", line 642, in run_until_complete
    return future.result()
  File "/usr/local/venv/lib/python3.9/site-packages/lmdeploy/serve/async_engine.py", line 425, in gather
    await asyncio.gather(
  File "/usr/local/venv/lib/python3.9/site-packages/lmdeploy/serve/async_engine.py", line 410, in _inner_call
    async for out in generator:
  File "/usr/local/venv/lib/python3.9/site-packages/lmdeploy/serve/async_engine.py", line 571, in generate
    prompt_input = await self._get_prompt_input(prompt,
  File "/usr/local/venv/lib/python3.9/site-packages/lmdeploy/serve/vl_async_engine.py", line 59, in _get_prompt_input
    segs = decorated.split(IMAGE_TOKEN)
AttributeError: 'NoneType' object has no attribute 'split'

Thank you for your help!

Jul 25 '24 08:07 vietpho