lmdeploy [Feature] Support Quantized InternLM-XComposer2-VL model

Motivation

I'd like to use the InternLM-XComposer2-VL-7B-4bit model in lmdeploy.

I think the 4Bit Quantized model uses less VRAM and is faster to compute, which makes sense for using lmdeploy.

Related resources

InternLM-XComposer2-VL-7B: https://huggingface.co/internlm/internlm-xcomposer2-vl-7b
InternLM-XComposer2-VL-7B-4bit: https://huggingface.co/internlm/internlm-xcomposer2-vl-7b-4bit

Additional context

At first glance, it looked like lmdeploy was referencing a layer-specific JSON file named pytorch_model.bin.index.json. So I used model.save_quantized() in AutoGPTQ to save with use_safetensors=False, after then created the pytorch_model.bin.index.json file by reading the layer keys from model.

But I got the error like KeyError: 'model.layers.0.feed_forward.w1.weight'. It looks like deeper changes at lmdeploy side are needed, such as excluding unquantized weights while loading them (as I guess).

Apr 26 '24 01:04 9bow

Currently, LMDeploy does not gaurantee VL model quantization. Will support this in the next version in May.

Apr 28 '24 02:04 AllentDan

Thanks for letting us know. The InternLM-XComposer2-VL-7B model is working well for me :) I'll wait for the next version that would support the Qantized models.

Apr 28 '24 03:04 9bow

The latest lmdeploy supports the feature.

May 28 '24 03:05 AllentDan

Hi @AllentDan,

Firstly, thanks a lot for the amazing work you are doing with LMDeploy!

When I try to use https://huggingface.co/internlm/internlm-xcomposer2-vl-7b-4bit with lmdeploy 0.4.2 I get the error referenced in the OP's initial message.

from lmdeploy import pipeline, GenerationConfig
from lmdeploy.vl import load_image
from os.path import join

gen_config = GenerationConfig(
    temperature=0.8,
    top_p=1.0,
    top_k=1.0,
    random_seed=4114
)
    
pipe = pipeline("/local/path/to/internlm/internlm-xcomposer2-vl-7b-4bit/")

sample = (
    "Please write a detailed description of this image.", 
    load_image('/path/to/toy/img.png')
) 
response = pipe(sample, gen_config=gen_config)

print(response)

Set max length to 4096
Position interpolate from 24x24 to 35x35
Position interpolate from 24x24 to 35x35
Traceback (most recent call last):
  File "/path/to/my/awesome/code/scratch_pad.py", line 71, in <module>
    pipe = pipeline("/path/to/models/huggingface/internlm/internlm-xcomposer2-vl-7b-4bit/")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/anaconda3/envs/myenv/lib/python3.11/site-packages/lmdeploy/api.py", line 94, in pipeline
    return pipeline_class(model_path,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/anaconda3/envs/myenv/lib/python3.11/site-packages/lmdeploy/serve/vl_async_engine.py", line 20, in __init__
    self.vl_encoder = ImageEncoder(model_path, vision_config)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/anaconda3/envs/myenv/lib/python3.11/site-packages/lmdeploy/vl/engine.py", line 69, in __init__
    self.model = load_vl_model(model_path)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/anaconda3/envs/myenv/lib/python3.11/site-packages/lmdeploy/vl/model/builder.py", line 40, in load_vl_model
    return Xcomposer2VisionModel(model_path, with_llm)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/anaconda3/envs/myenv/lib/python3.11/site-packages/lmdeploy/vl/model/xcomposer2.py", line 42, in __init__
    self.build_model()
  File "/path/to/anaconda3/envs/myenv/lib/python3.11/site-packages/lmdeploy/vl/model/xcomposer2.py", line 89, in build_model
    load_checkpoint_and_dispatch(
  File "/path/to/anaconda3/envs/myenv/lib/python3.11/site-packages/accelerate/big_modeling.py", line 607, in load_checkpoint_and_dispatch
    load_checkpoint_in_model(
  File "/path/to/anaconda3/envs/myenv/lib/python3.11/site-packages/accelerate/utils/modeling.py", line 1637, in load_checkpoint_in_model
    raise ValueError(
ValueError: /path/to/models/huggingface/internlm/internlm-xcomposer2-vl-7b-4bit/ is not a folder containing a `.index.json` file or a pytorch_model.bin or a model.safetensors file

Could you please point me towards what I am doing wrong? I am also using the code on V100. Thanks!

Jun 26 '24 19:06 danieltudosiu

I was able to run a forward pass but it consumes 30.9GB of VRAM if I do the following:

from lmdeploy import pipeline, GenerationConfig
from lmdeploy.vl import load_image
from os.path import join

gen_config = GenerationConfig(
    temperature=0.8,
    top_p=1.0,
    top_k=1.0,
    random_seed=4114
)

te_config = TurbomindEngineConfig(quant_policy=4)
    
pipe = pipeline(
    "/local/path/to/internlm/internlm-xcomposer2-vl-7b-4bit/",
    backend_config=te_config
)

sample = (
    "Please write a detailed description of this image.", 
    load_image('/path/to/toy/img.png')
) 
response = pipe(sample, gen_config=gen_config)

print(response)

The output is the following:

You are using a model of type internlmxcomposer2 to instantiate a model of type internlm. This is not supported for all configurations of models and can yield errors.
Set max length to 4096
You are using a model of type internlmxcomposer2 to instantiate a model of type internlm. This is not supported for all configurations of models and can yield errors.
You are using a model of type internlmxcomposer2 to instantiate a model of type internlm. This is not supported for all configurations of models and can yield errors.
You are using a model of type internlmxcomposer2 to instantiate a model of type internlm. This is not supported for all configurations of models and can yield errors.
Device does not support bf16.
[WARNING] gemm_config.in is not found; using default GEMM algo                                                                                                                                                                                                          
Response(text='The image captures a lively scene on a bustling city street. Three men, dressed in vibrant red robes and matching hats, are the main subjects of the image. They are pushing a cart filled with oranges, adding a splash of color to the scene. The men appear to be walking down the street, possibly selling their oranges to passersby.', generate_token_len=74, input_token_len=1419, session_id=0, finish_reason='stop', token_ids=[918, 2321, 39909, 395, 47529, 6262, 519, 395, 20988, 2880, 3446, 8725, 281, 14650, 3118, 328, 25802, 435, 33083, 2674, 1064, 9561, 454, 12730, 43974, 328, 657, 410, 2036, 15007, 446, 410, 2321, 281, 2533, 657, 17601, 395, 7552, 10336, 579, 607, 5676, 328, 7980, 395, 34651, 446, 2044, 442, 410, 6262, 281, 707, 3118, 5153, 442, 517, 11584, 1641, 410, 8725, 328, 10915, 11387, 998, 607, 5676, 442, 1640, 518, 1844, 281], logprobs=None)

If I do not pass backend_config the code above uses the same 30.9 GB of VRAM.

Jun 26 '24 19:06 danieltudosiu

https://huggingface.co/internlm/internlm-xcomposer2-vl-7b-4bit is a GPTQ quantized model which is not supported by lmdeploy. Please refer to our documents for w4a16 inference. https://lmdeploy.readthedocs.io/en/v0.4.2/quantization/w4a16.html

Jun 27 '24 00:06 AllentDan

lmdeploy lmdeploy copied to clipboard

[Feature] Support Quantized InternLM-XComposer2-VL model

Motivation

Related resources

Additional context

lmdeploy
lmdeploy copied to clipboard