About quantized W8A8 model
I found that the quantization algorithm has fixed supported models. If I want to perform int8 quantization on my own custom model, how can I do it?
You can follow the README under tools/convertor/profiling_activation, run the get_act_distribution.py first to profile the activation distribution from a set of inputs. Then you can modify the tools/convertor/profiling_activation/utils/quantization_simulation.py and implement the quantization method for your custom model. You can use the simulate_inference.py to get the result.
Does this method also work for Mixture-of-Experts(MoE) models?
Though we haven't implemented MoE models for QNN, the quantization method will work for them, as it just replaces the Linear Layer with the W8A8Linear.
If I quantize a MoE model and implement the definition, configuration, and tokenizer of the MoE model, I still cannot use the NPU of mllm to accelerate the MoE model, right?
To let the MoE model run in mllm with QNN offload, you need to implement the modeling file, which should split the model to different backend parts. For MoE models which have multiple MLPs, you need to implement different sub-modules. And the biggest challenge currently is to init the QNN module(build QNN graphs) before execution.