mllm icon indicating copy to clipboard operation
mllm copied to clipboard

About quantized W8A8 model

Open zyf-gh opened this issue 9 months ago • 5 comments

I found that the quantization algorithm has fixed supported models. If I want to perform int8 quantization on my own custom model, how can I do it?

zyf-gh avatar Mar 28 '25 05:03 zyf-gh

You can follow the README under tools/convertor/profiling_activation, run the get_act_distribution.py first to profile the activation distribution from a set of inputs. Then you can modify the tools/convertor/profiling_activation/utils/quantization_simulation.py and implement the quantization method for your custom model. You can use the simulate_inference.py to get the result.

oreomaker avatar Mar 28 '25 05:03 oreomaker

Does this method also work for Mixture-of-Experts(MoE) models?

zyf-gh avatar Mar 28 '25 05:03 zyf-gh

Though we haven't implemented MoE models for QNN, the quantization method will work for them, as it just replaces the Linear Layer with the W8A8Linear.

oreomaker avatar Mar 28 '25 05:03 oreomaker

If I quantize a MoE model and implement the definition, configuration, and tokenizer of the MoE model, I still cannot use the NPU of mllm to accelerate the MoE model, right?

zyf-gh avatar Mar 28 '25 05:03 zyf-gh

To let the MoE model run in mllm with QNN offload, you need to implement the modeling file, which should split the model to different backend parts. For MoE models which have multiple MLPs, you need to implement different sub-modules. And the biggest challenge currently is to init the QNN module(build QNN graphs) before execution.

oreomaker avatar Mar 28 '25 05:03 oreomaker