lmdeploy
lmdeploy copied to clipboard
[Feature] Support Quantized InternLM-XComposer2-VL model
Motivation
I'd like to use the InternLM-XComposer2-VL-7B-4bit
model in lmdeploy.
I think the 4Bit Quantized model uses less VRAM and is faster to compute, which makes sense for using lmdeploy.
Related resources
- InternLM-XComposer2-VL-7B: https://huggingface.co/internlm/internlm-xcomposer2-vl-7b
- InternLM-XComposer2-VL-7B-4bit: https://huggingface.co/internlm/internlm-xcomposer2-vl-7b-4bit
Additional context
At first glance, it looked like lmdeploy was referencing a layer-specific JSON file named pytorch_model.bin.index.json
. So I used model.save_quantized()
in AutoGPTQ to save with use_safetensors=False
, after then created the pytorch_model.bin.index.json
file by reading the layer keys from model.
But I got the error like KeyError: 'model.layers.0.feed_forward.w1.weight'
.
It looks like deeper changes at lmdeploy side are needed, such as excluding unquantized weights while loading them (as I guess).
Currently, LMDeploy does not gaurantee VL model quantization. Will support this in the next version in May.
Thanks for letting us know. The InternLM-XComposer2-VL-7B
model is working well for me :)
I'll wait for the next version that would support the Qantized models.
The latest lmdeploy supports the feature.
Hi @AllentDan,
Firstly, thanks a lot for the amazing work you are doing with LMDeploy!
When I try to use https://huggingface.co/internlm/internlm-xcomposer2-vl-7b-4bit with lmdeploy 0.4.2 I get the error referenced in the OP's initial message.
from lmdeploy import pipeline, GenerationConfig
from lmdeploy.vl import load_image
from os.path import join
gen_config = GenerationConfig(
temperature=0.8,
top_p=1.0,
top_k=1.0,
random_seed=4114
)
pipe = pipeline("/local/path/to/internlm/internlm-xcomposer2-vl-7b-4bit/")
sample = (
"Please write a detailed description of this image.",
load_image('/path/to/toy/img.png')
)
response = pipe(sample, gen_config=gen_config)
print(response)
Set max length to 4096
Position interpolate from 24x24 to 35x35
Position interpolate from 24x24 to 35x35
Traceback (most recent call last):
File "/path/to/my/awesome/code/scratch_pad.py", line 71, in <module>
pipe = pipeline("/path/to/models/huggingface/internlm/internlm-xcomposer2-vl-7b-4bit/")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/to/anaconda3/envs/myenv/lib/python3.11/site-packages/lmdeploy/api.py", line 94, in pipeline
return pipeline_class(model_path,
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/to/anaconda3/envs/myenv/lib/python3.11/site-packages/lmdeploy/serve/vl_async_engine.py", line 20, in __init__
self.vl_encoder = ImageEncoder(model_path, vision_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/to/anaconda3/envs/myenv/lib/python3.11/site-packages/lmdeploy/vl/engine.py", line 69, in __init__
self.model = load_vl_model(model_path)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/to/anaconda3/envs/myenv/lib/python3.11/site-packages/lmdeploy/vl/model/builder.py", line 40, in load_vl_model
return Xcomposer2VisionModel(model_path, with_llm)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/to/anaconda3/envs/myenv/lib/python3.11/site-packages/lmdeploy/vl/model/xcomposer2.py", line 42, in __init__
self.build_model()
File "/path/to/anaconda3/envs/myenv/lib/python3.11/site-packages/lmdeploy/vl/model/xcomposer2.py", line 89, in build_model
load_checkpoint_and_dispatch(
File "/path/to/anaconda3/envs/myenv/lib/python3.11/site-packages/accelerate/big_modeling.py", line 607, in load_checkpoint_and_dispatch
load_checkpoint_in_model(
File "/path/to/anaconda3/envs/myenv/lib/python3.11/site-packages/accelerate/utils/modeling.py", line 1637, in load_checkpoint_in_model
raise ValueError(
ValueError: /path/to/models/huggingface/internlm/internlm-xcomposer2-vl-7b-4bit/ is not a folder containing a `.index.json` file or a pytorch_model.bin or a model.safetensors file
Could you please point me towards what I am doing wrong? I am also using the code on V100. Thanks!
I was able to run a forward pass but it consumes 30.9GB of VRAM if I do the following:
from lmdeploy import pipeline, GenerationConfig
from lmdeploy.vl import load_image
from os.path import join
gen_config = GenerationConfig(
temperature=0.8,
top_p=1.0,
top_k=1.0,
random_seed=4114
)
te_config = TurbomindEngineConfig(quant_policy=4)
pipe = pipeline(
"/local/path/to/internlm/internlm-xcomposer2-vl-7b-4bit/",
backend_config=te_config
)
sample = (
"Please write a detailed description of this image.",
load_image('/path/to/toy/img.png')
)
response = pipe(sample, gen_config=gen_config)
print(response)
The output is the following:
You are using a model of type internlmxcomposer2 to instantiate a model of type internlm. This is not supported for all configurations of models and can yield errors.
Set max length to 4096
You are using a model of type internlmxcomposer2 to instantiate a model of type internlm. This is not supported for all configurations of models and can yield errors.
You are using a model of type internlmxcomposer2 to instantiate a model of type internlm. This is not supported for all configurations of models and can yield errors.
You are using a model of type internlmxcomposer2 to instantiate a model of type internlm. This is not supported for all configurations of models and can yield errors.
Device does not support bf16.
[WARNING] gemm_config.in is not found; using default GEMM algo
Response(text='The image captures a lively scene on a bustling city street. Three men, dressed in vibrant red robes and matching hats, are the main subjects of the image. They are pushing a cart filled with oranges, adding a splash of color to the scene. The men appear to be walking down the street, possibly selling their oranges to passersby.', generate_token_len=74, input_token_len=1419, session_id=0, finish_reason='stop', token_ids=[918, 2321, 39909, 395, 47529, 6262, 519, 395, 20988, 2880, 3446, 8725, 281, 14650, 3118, 328, 25802, 435, 33083, 2674, 1064, 9561, 454, 12730, 43974, 328, 657, 410, 2036, 15007, 446, 410, 2321, 281, 2533, 657, 17601, 395, 7552, 10336, 579, 607, 5676, 328, 7980, 395, 34651, 446, 2044, 442, 410, 6262, 281, 707, 3118, 5153, 442, 517, 11584, 1641, 410, 8725, 328, 10915, 11387, 998, 607, 5676, 442, 1640, 518, 1844, 281], logprobs=None)
If I do not pass backend_config the code above uses the same 30.9 GB of VRAM.
https://huggingface.co/internlm/internlm-xcomposer2-vl-7b-4bit is a GPTQ quantized model which is not supported by lmdeploy. Please refer to our documents for w4a16 inference. https://lmdeploy.readthedocs.io/en/v0.4.2/quantization/w4a16.html