lmdeploy icon indicating copy to clipboard operation
lmdeploy copied to clipboard

[Feature] Support vl models quantization

Open AllentDan opened this issue 1 year ago • 7 comments

  • [x] deepseek vl
  • [x] llava
  • [x] internvl
  • [x] xcomposer (did not quant plora)
  • [x] minigemini
  • [x] yi
  • [x] qwen
  • [x] internvl-llava

AllentDan avatar May 07 '24 10:05 AllentDan

xcomposer2 量化的时候,weight_type 是int4,LlamaLinear.h 是需要改的,不然只会经过forwardInt4,不会经过plora

irexyc avatar May 14 '24 07:05 irexyc

如果 xcompose2 的量化比较复杂,建议使用另外的PR单独处理。不然,可能会影响review速度,也可能和其他PR有冲突

lvhan028 avatar May 17 '24 10:05 lvhan028

如果 xcompose2 的量化比较复杂,建议使用另外的PR单独处理。不然,可能会影响review速度,也可能和其他PR有冲突

可以 work 了,但是遇到个怪事,就上面的 comment

AllentDan avatar May 17 '24 10:05 AllentDan

Pls resolve the conflicts

lvhan028 avatar May 18 '24 13:05 lvhan028

Following is the performance test result of turbomind VL model and turbomind awq VL model without search scale on MMBench.

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃                                    llava-v1.6-vicuna-7b      awq     InternVL-Chat-V1-5     awq      xcomposer2-vl-7b    awq      ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ Average                                 │         55.8 │         55.3 │         78.8 │         79.2 │         77.3 │         74.7 │
├─────────────────────────────────────────┼──────────────┤──────────────┼──────────────┤──────────────┼──────────────┤──────────────┤
│ attribute_reasoning                     │         57.8 │         60.3 │         79.9 │         81.4 │         80.4 │         77.4 │
│ coarse_perception                       │         69.9 │         69.6 │         85.1 │         86.5 │         86.5 │         85.8 │
│ finegrained_perception (cross-instance) │         45.5 │         45.5 │         68.5 │         62.9 │         64.3 │         64.3 │
│ finegrained_perception (instance-level) │         57.7 │         54.3 │         85.0 │         86.3 │         80.9 │         76.5 │
│ logic_reasoning                         │         31.4 │         28.8 │         54.2 │         56.8 │         55.1 │         45.8 │
│ relation_reasoning                      │         48.7 │         52.2 │         82.6 │         81.7 │         78.3 │         79.1 │
│ action_recognition                      │         88.9 │         79.6 │         92.6 │         92.6 │         85.2 │         87.0 │
│ attribute_comparison                    │         25.0 │         31.8 │         72.7 │         65.9 │         72.7 │         68.2 │
│ attribute_recognition                   │         77.0 │         68.9 │         97.3 │         97.3 │         94.6 │         89.2 │
│ celebrity_recognition                   │         71.7 │         69.7 │         91.9 │         93.9 │         85.9 │         82.8 │
│ function_reasoning                      │         64.6 │         64.6 │         86.1 │         86.1 │         83.5 │         78.5 │
│ future_prediction                       │         40.0 │         35.0 │         47.5 │         50.0 │         50.0 │         52.5 │
│ identity_reasoning                      │         84.4 │         88.9 │         97.8 │         97.8 │         97.8 │         97.8 │
│ image_emotion                           │         76.0 │         76.0 │         78.0 │         82.0 │         82.0 │         80.0 │
│ image_quality                           │         39.6 │         41.5 │         56.6 │         58.5 │         71.7 │         73.6 │
│ image_scene                             │         77.9 │         75.0 │         97.1 │         97.1 │         95.2 │         96.2 │
│ image_style                             │         81.1 │         79.2 │         92.5 │         94.3 │         84.9 │         83.0 │
│ image_topic                             │         66.7 │         72.2 │         91.7 │         91.7 │         91.7 │         86.1 │
│ nature_relation                         │         39.6 │         41.7 │         77.1 │         77.1 │         62.5 │         64.6 │
│ object_localization                     │         23.5 │         19.8 │         70.4 │         70.4 │         70.4 │         63.0 │
│ ocr                                     │         56.4 │         59.0 │         74.4 │         79.5 │         64.1 │         64.1 │
│ physical_property_reasoning             │         34.7 │         38.7 │         62.7 │         66.7 │         66.7 │         64.0 │
│ physical_relation                       │          4.2 │          8.3 │         75.0 │         70.8 │         79.2 │         79.2 │
│ social_relation                         │         83.7 │         88.4 │         93.0 │         93.0 │         95.3 │         95.3 │
│ spatial_relationship                    │         13.3 │         17.8 │         35.6 │         24.4 │         31.1 │         33.3 │
│ structuralized_imagetext_understanding  │         26.9 │         25.6 │         57.7 │         60.3 │         57.7 │         42.3 │
└─────────────────────────────────────────┴──────────────┴──────────────┴──────────────┴──────────────┴──────────────┴──────────────┘

AllentDan avatar May 20 '24 11:05 AllentDan

添加如下模型 4bits量化,正常对话和vl图片识别用例通过

  • Qwen/Qwen-VL-Chat
  • liuhaotian/llava-v1.5-7b
  • liuhaotian/llava-v1.5-13b
  • liuhaotian/llava-v1.6-vicuna-7b
  • 01-ai/Yi-VL-6B
  • deepseek-ai/deepseek-vl-1.3b-chat
  • OpenGVLab/InternVL-Chat-V1-5
  • internlm/internlm-xcomposer2-vl-7b

zhulinJulia24 avatar May 21 '24 06:05 zhulinJulia24

添加如下模型 4bits量化,正常对话和vl图片识别用例通过

  • Qwen/Qwen-VL-Chat
  • liuhaotian/llava-v1.5-7b
  • liuhaotian/llava-v1.5-13b
  • liuhaotian/llava-v1.6-vicuna-7b
  • 01-ai/Yi-VL-6B
  • deepseek-ai/deepseek-vl-1.3b-chat
  • OpenGVLab/InternVL-Chat-V1-5
  • internlm/internlm-xcomposer2-vl-7b

请问一下量化好的 4bits 模型会在哪里可以下载

rTrQqgH74lc2PT5k avatar May 21 '24 13:05 rTrQqgH74lc2PT5k

Performance tested OK.

AllentDan avatar May 24 '24 01:05 AllentDan

@AllentDan use the latest code to quant model internlm/internlm-xcomposer2-4khd-7b with following command got this error

  • command : lmdeploy lite auto_awq internlm/internlm-xcomposer2-4khd-7b --work-dir /data/quant/internlm-xcomposer2-4khd-7b-4bit
can't find model from local_path internlm/internlm-xcomposer2-4khd-7b, try to download from remote                                                                                                           
Fetching 22 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 22/22 [00:00<00:00, 102641.48it/s]
You are using a model of type internlmxcomposer2 to instantiate a model of type internlm2. This is not supported for all configurations of models and can yield errors.                                      
Set max length to 16384                                                                                                                                                                                      
Dummy Resized                                                                                                                                                                                                
^[[1;3BMove model.tok_embeddings to GPU.                                                                                                                                                                     
Move model.layers.0 to CPU.                                                                                                                                                                                  
Move model.layers.1 to CPU.                                                                                                                                                                                  
Move model.layers.2 to CPU.                                                                                                                                                                                  
Move model.layers.3 to CPU.                                                                                                                                                                                  
Move model.layers.4 to CPU.                                                                                                                                                                                  
Move model.layers.5 to CPU.                                                                                                                                                                                  
Move model.layers.6 to CPU.                                                                                                                                                                                  
Move model.layers.7 to CPU.                                                                                                                                                                                  
Move model.layers.8 to CPU.                                                                                                                                                                                  
Move model.layers.9 to CPU.                                                                                                                                                                                  
Move model.layers.10 to CPU.                                                                                                                                                                                 
Move model.layers.11 to CPU.                                                                                                                                                                                 
Move model.layers.12 to CPU.                                                                                                                                                                                 
Move model.layers.13 to CPU.                                                                                                                                                                                 
Move model.layers.14 to CPU.                                                                                                                                                                                 
Move model.layers.15 to CPU.                                                                                                                                                                                 
Move model.layers.16 to CPU.                                                                                                                                                                                 
Move model.layers.17 to CPU.                                                                                                                                                                                 
Move model.layers.18 to CPU.                                                                                                                                                                                 
Move model.layers.19 to CPU.                                                                                                                                                                                 
Move model.layers.20 to CPU.                                                                                                                                                                                 
Move model.layers.21 to CPU.                                                                                                                                                                                 
Move model.layers.22 to CPU.                                                                                                                                                                                 
Move model.layers.23 to CPU.                                                                                                                                                                                 
Move model.layers.24 to CPU.                                                                                                                                                                                 
Move model.layers.25 to CPU.                                                                                                                                                                                 
Move model.layers.26 to CPU.                                                                                                                                                                                 
Move model.layers.27 to CPU.                                    
Move model.layers.28 to CPU.
Move model.layers.29 to CPU.
Move model.layers.30 to CPU.
Move model.layers.31 to CPU.
Move model.norm to GPU.
Move output to CPU.
Move vit to GPU.
Move vision_proj to GPU.
Loading calibrate dataset ...
Traceback (most recent call last):
  File "/opt/py38/bin/lmdeploy", line 11, in <module>
    load_entry_point('lmdeploy', 'console_scripts', 'lmdeploy')()
  File "/opt/lmdeploy/lmdeploy/cli/entrypoint.py", line 37, in run
    args.run(args)
  File "/opt/lmdeploy/lmdeploy/cli/lite.py", line 137, in auto_awq
    auto_awq(**kwargs)
  File "/opt/lmdeploy/lmdeploy/lite/apis/auto_awq.py", line 96, in auto_awq
    vl_model, model, tokenizer, work_dir = calibrate(model,
  File "/opt/lmdeploy/lmdeploy/lite/apis/calibrate.py", line 235, in calibrate
    calib_ctx.calibrate(all_data)
  File "/opt/lmdeploy/lmdeploy/lite/quantization/calibration.py", line 315, in calibrate
    _ = model(data.to(self.device))
  File "/opt/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/a2c222ebd3a723c3dff00232e4f5cc6429f472d1/modeling_internlm2.py", line 958, in forward
    layer_outputs = decoder_layer(
  File "/opt/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/lmdeploy/lmdeploy/lite/quantization/calibration.py", line 195, in _forward
    out = self._ori_forwards[mod](*batch_args[i],
  File "/root/.cache/huggingface/modules/transformers_modules/a2c222ebd3a723c3dff00232e4f5cc6429f472d1/modeling_internlm2.py", line 659, in forward
    hidden_states, self_attn_weights, present_key_value = self.attention(
  File "/opt/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/a2c222ebd3a723c3dff00232e4f5cc6429f472d1/modeling_internlm2.py", line 361, in forward
    qkv_states = self.wqkv(hidden_states, im_mask)
  File "/opt/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/a2c222ebd3a723c3dff00232e4f5cc6429f472d1/build_mlp.py", line 204, in forward
    res[:1] += self.Plora_B(self.Plora_A(
  File "/opt/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1557, in _call_impl
    args_result = hook(self, args)
  File "/opt/lmdeploy/lmdeploy/lite/quantization/calibration.py", line 125, in _input_hook
    obs.observe(inp[0])
  File "/opt/py38/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/lmdeploy/lmdeploy/lite/quantization/activation/observer.py", line 104, in observe
    assert len(x.shape) == 3
AssertionError

sshuair avatar May 27 '24 09:05 sshuair

Seems we did not test this model yet. I will support it afterwards. @sshuair

AllentDan avatar May 27 '24 09:05 AllentDan

@sshuair I added a new PR #1666 to support it. You may give it a try.

AllentDan avatar May 28 '24 03:05 AllentDan