mllm About quantized W8A8 model

I found that the quantization algorithm has fixed supported models. If I want to perform int8 quantization on my own custom model, how can I do it?

Mar 28 '25 05:03 zyf-gh

You can follow the README under tools/convertor/profiling_activation, run the get_act_distribution.py first to profile the activation distribution from a set of inputs. Then you can modify the tools/convertor/profiling_activation/utils/quantization_simulation.py and implement the quantization method for your custom model. You can use the simulate_inference.py to get the result.

Mar 28 '25 05:03 oreomaker

Does this method also work for Mixture-of-Experts(MoE) models?

Mar 28 '25 05:03 zyf-gh

Though we haven't implemented MoE models for QNN, the quantization method will work for them, as it just replaces the Linear Layer with the W8A8Linear.

Mar 28 '25 05:03 oreomaker

If I quantize a MoE model and implement the definition, configuration, and tokenizer of the MoE model, I still cannot use the NPU of mllm to accelerate the MoE model, right?

Mar 28 '25 05:03 zyf-gh

To let the MoE model run in mllm with QNN offload, you need to implement the modeling file, which should split the model to different backend parts. For MoE models which have multiple MLPs, you need to implement different sub-modules. And the biggest challenge currently is to init the QNN module(build QNN graphs) before execution.

Mar 28 '25 05:03 oreomaker