"Clarification on Multimodal Model Quantization and Default Calibration Dataset"

Open donghong1 opened this issue 11 months ago • 1 comments

Hello,

I have a few questions regarding the quantization of multimodal models:

1、Does the current version of AutoAWQ quantize only the language model, or does it also include the vision component for quantization? 2、What is the default calibration dataset used for quantization? 3、I noticed that the example code for Qwen2-VL uses a custom multimodal dataset. Is this dataset required for all multimodal model quantizations, or can we use the default dataset? Thank you for your clarification!

Feb 17 '25 03:02 donghong1

Hi, @donghong1

As far as I know, AWQ itself quantizes only the language model, not vision encoders. I saw the paper proposed quantizing vision encoders but not sure which paper is.
It seems the authors uses pile-val-backup dataset in there paper.
I am not 100% sure but I guess we should multimodal dataset for vlms.

Feb 20 '25 04:02 seungwoos