lmdeploy
lmdeploy copied to clipboard
[Feature] Support vl models quantization
- [x] deepseek vl
- [x] llava
- [x] internvl
- [x] xcomposer (did not quant plora)
- [x] minigemini
- [x] yi
- [x] qwen
- [x] internvl-llava
xcomposer2 量化的时候,weight_type 是int4,LlamaLinear.h 是需要改的,不然只会经过forwardInt4,不会经过plora
如果 xcompose2 的量化比较复杂,建议使用另外的PR单独处理。不然,可能会影响review速度,也可能和其他PR有冲突
如果 xcompose2 的量化比较复杂,建议使用另外的PR单独处理。不然,可能会影响review速度,也可能和其他PR有冲突
可以 work 了,但是遇到个怪事,就上面的 comment
Pls resolve the conflicts
Following is the performance test result of turbomind VL model and turbomind awq VL model without search scale on MMBench.
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ llava-v1.6-vicuna-7b awq InternVL-Chat-V1-5 awq xcomposer2-vl-7b awq ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ Average │ 55.8 │ 55.3 │ 78.8 │ 79.2 │ 77.3 │ 74.7 │
├─────────────────────────────────────────┼──────────────┤──────────────┼──────────────┤──────────────┼──────────────┤──────────────┤
│ attribute_reasoning │ 57.8 │ 60.3 │ 79.9 │ 81.4 │ 80.4 │ 77.4 │
│ coarse_perception │ 69.9 │ 69.6 │ 85.1 │ 86.5 │ 86.5 │ 85.8 │
│ finegrained_perception (cross-instance) │ 45.5 │ 45.5 │ 68.5 │ 62.9 │ 64.3 │ 64.3 │
│ finegrained_perception (instance-level) │ 57.7 │ 54.3 │ 85.0 │ 86.3 │ 80.9 │ 76.5 │
│ logic_reasoning │ 31.4 │ 28.8 │ 54.2 │ 56.8 │ 55.1 │ 45.8 │
│ relation_reasoning │ 48.7 │ 52.2 │ 82.6 │ 81.7 │ 78.3 │ 79.1 │
│ action_recognition │ 88.9 │ 79.6 │ 92.6 │ 92.6 │ 85.2 │ 87.0 │
│ attribute_comparison │ 25.0 │ 31.8 │ 72.7 │ 65.9 │ 72.7 │ 68.2 │
│ attribute_recognition │ 77.0 │ 68.9 │ 97.3 │ 97.3 │ 94.6 │ 89.2 │
│ celebrity_recognition │ 71.7 │ 69.7 │ 91.9 │ 93.9 │ 85.9 │ 82.8 │
│ function_reasoning │ 64.6 │ 64.6 │ 86.1 │ 86.1 │ 83.5 │ 78.5 │
│ future_prediction │ 40.0 │ 35.0 │ 47.5 │ 50.0 │ 50.0 │ 52.5 │
│ identity_reasoning │ 84.4 │ 88.9 │ 97.8 │ 97.8 │ 97.8 │ 97.8 │
│ image_emotion │ 76.0 │ 76.0 │ 78.0 │ 82.0 │ 82.0 │ 80.0 │
│ image_quality │ 39.6 │ 41.5 │ 56.6 │ 58.5 │ 71.7 │ 73.6 │
│ image_scene │ 77.9 │ 75.0 │ 97.1 │ 97.1 │ 95.2 │ 96.2 │
│ image_style │ 81.1 │ 79.2 │ 92.5 │ 94.3 │ 84.9 │ 83.0 │
│ image_topic │ 66.7 │ 72.2 │ 91.7 │ 91.7 │ 91.7 │ 86.1 │
│ nature_relation │ 39.6 │ 41.7 │ 77.1 │ 77.1 │ 62.5 │ 64.6 │
│ object_localization │ 23.5 │ 19.8 │ 70.4 │ 70.4 │ 70.4 │ 63.0 │
│ ocr │ 56.4 │ 59.0 │ 74.4 │ 79.5 │ 64.1 │ 64.1 │
│ physical_property_reasoning │ 34.7 │ 38.7 │ 62.7 │ 66.7 │ 66.7 │ 64.0 │
│ physical_relation │ 4.2 │ 8.3 │ 75.0 │ 70.8 │ 79.2 │ 79.2 │
│ social_relation │ 83.7 │ 88.4 │ 93.0 │ 93.0 │ 95.3 │ 95.3 │
│ spatial_relationship │ 13.3 │ 17.8 │ 35.6 │ 24.4 │ 31.1 │ 33.3 │
│ structuralized_imagetext_understanding │ 26.9 │ 25.6 │ 57.7 │ 60.3 │ 57.7 │ 42.3 │
└─────────────────────────────────────────┴──────────────┴──────────────┴──────────────┴──────────────┴──────────────┴──────────────┘
添加如下模型 4bits量化,正常对话和vl图片识别用例通过
- Qwen/Qwen-VL-Chat
- liuhaotian/llava-v1.5-7b
- liuhaotian/llava-v1.5-13b
- liuhaotian/llava-v1.6-vicuna-7b
- 01-ai/Yi-VL-6B
- deepseek-ai/deepseek-vl-1.3b-chat
- OpenGVLab/InternVL-Chat-V1-5
- internlm/internlm-xcomposer2-vl-7b
添加如下模型 4bits量化,正常对话和vl图片识别用例通过
- Qwen/Qwen-VL-Chat
- liuhaotian/llava-v1.5-7b
- liuhaotian/llava-v1.5-13b
- liuhaotian/llava-v1.6-vicuna-7b
- 01-ai/Yi-VL-6B
- deepseek-ai/deepseek-vl-1.3b-chat
- OpenGVLab/InternVL-Chat-V1-5
- internlm/internlm-xcomposer2-vl-7b
请问一下量化好的 4bits 模型会在哪里可以下载
Performance tested OK.
@AllentDan use the latest code to quant model internlm/internlm-xcomposer2-4khd-7b with following command got this error
- command :
lmdeploy lite auto_awq internlm/internlm-xcomposer2-4khd-7b --work-dir /data/quant/internlm-xcomposer2-4khd-7b-4bit
can't find model from local_path internlm/internlm-xcomposer2-4khd-7b, try to download from remote
Fetching 22 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 22/22 [00:00<00:00, 102641.48it/s]
You are using a model of type internlmxcomposer2 to instantiate a model of type internlm2. This is not supported for all configurations of models and can yield errors.
Set max length to 16384
Dummy Resized
^[[1;3BMove model.tok_embeddings to GPU.
Move model.layers.0 to CPU.
Move model.layers.1 to CPU.
Move model.layers.2 to CPU.
Move model.layers.3 to CPU.
Move model.layers.4 to CPU.
Move model.layers.5 to CPU.
Move model.layers.6 to CPU.
Move model.layers.7 to CPU.
Move model.layers.8 to CPU.
Move model.layers.9 to CPU.
Move model.layers.10 to CPU.
Move model.layers.11 to CPU.
Move model.layers.12 to CPU.
Move model.layers.13 to CPU.
Move model.layers.14 to CPU.
Move model.layers.15 to CPU.
Move model.layers.16 to CPU.
Move model.layers.17 to CPU.
Move model.layers.18 to CPU.
Move model.layers.19 to CPU.
Move model.layers.20 to CPU.
Move model.layers.21 to CPU.
Move model.layers.22 to CPU.
Move model.layers.23 to CPU.
Move model.layers.24 to CPU.
Move model.layers.25 to CPU.
Move model.layers.26 to CPU.
Move model.layers.27 to CPU.
Move model.layers.28 to CPU.
Move model.layers.29 to CPU.
Move model.layers.30 to CPU.
Move model.layers.31 to CPU.
Move model.norm to GPU.
Move output to CPU.
Move vit to GPU.
Move vision_proj to GPU.
Loading calibrate dataset ...
Traceback (most recent call last):
File "/opt/py38/bin/lmdeploy", line 11, in <module>
load_entry_point('lmdeploy', 'console_scripts', 'lmdeploy')()
File "/opt/lmdeploy/lmdeploy/cli/entrypoint.py", line 37, in run
args.run(args)
File "/opt/lmdeploy/lmdeploy/cli/lite.py", line 137, in auto_awq
auto_awq(**kwargs)
File "/opt/lmdeploy/lmdeploy/lite/apis/auto_awq.py", line 96, in auto_awq
vl_model, model, tokenizer, work_dir = calibrate(model,
File "/opt/lmdeploy/lmdeploy/lite/apis/calibrate.py", line 235, in calibrate
calib_ctx.calibrate(all_data)
File "/opt/lmdeploy/lmdeploy/lite/quantization/calibration.py", line 315, in calibrate
_ = model(data.to(self.device))
File "/opt/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/a2c222ebd3a723c3dff00232e4f5cc6429f472d1/modeling_internlm2.py", line 958, in forward
layer_outputs = decoder_layer(
File "/opt/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/lmdeploy/lmdeploy/lite/quantization/calibration.py", line 195, in _forward
out = self._ori_forwards[mod](*batch_args[i],
File "/root/.cache/huggingface/modules/transformers_modules/a2c222ebd3a723c3dff00232e4f5cc6429f472d1/modeling_internlm2.py", line 659, in forward
hidden_states, self_attn_weights, present_key_value = self.attention(
File "/opt/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/a2c222ebd3a723c3dff00232e4f5cc6429f472d1/modeling_internlm2.py", line 361, in forward
qkv_states = self.wqkv(hidden_states, im_mask)
File "/opt/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl
result = forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/a2c222ebd3a723c3dff00232e4f5cc6429f472d1/build_mlp.py", line 204, in forward
res[:1] += self.Plora_B(self.Plora_A(
File "/opt/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1557, in _call_impl
args_result = hook(self, args)
File "/opt/lmdeploy/lmdeploy/lite/quantization/calibration.py", line 125, in _input_hook
obs.observe(inp[0])
File "/opt/py38/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/opt/lmdeploy/lmdeploy/lite/quantization/activation/observer.py", line 104, in observe
assert len(x.shape) == 3
AssertionError
Seems we did not test this model yet. I will support it afterwards. @sshuair
@sshuair I added a new PR #1666 to support it. You may give it a try.