MNN icon indicating copy to clipboard operation
MNN copied to clipboard

[Question/Improvement] Specify the quantization method applied by llmexport.py and MNNConvert

Open Anderhar opened this issue 6 months ago • 5 comments

Having a clear understanding of the quantization method to be applied can be crucial when it comes to choosing the correct Quantization-Aware Trained (QAT) model. Unfortunately, the current docs for llmexport.py and MNNConvert is not explicit about what method is used.

Personally, as a novice, I'm lost in choosing between gemma-3-12b-qat-int4-unquantized and gemma-3-12b-qat-q4_0-unquantized, since I can't determine which one is the most preferable for conversion to MNN. So any tips are welcome.

Anderhar avatar Jun 03 '25 15:06 Anderhar

Well, regardless of the Gemma 3 version chosen (including non-QAT one), llmexport.py responds with AttributeError: 'NoneType' object has no attribute 'weight'.

Anderhar avatar Jun 04 '25 23:06 Anderhar

Well, regardless of the Gemma 3 version chosen (including non-QAT one), llmexport.py responds with AttributeError: 'NoneType' object has no attribute 'weight'.

Which kind of gemma 3 model are you convert? For Gemma 3 4b and 1 b we test is ok. Do you update mnn to 3.2.0?

jxt1234 avatar Jun 10 '25 05:06 jxt1234

See https://mnn-docs.readthedocs.io/en/latest/transformers/llm.html , or execute python3 llmexport -h you can see the quant option.

  --awq                 Whether or not to use awq quant.
  --sym                 Whether or not to using symmetric quant (without zeropoint), defualt is False.
  --quant_bit QUANT_BIT
                        mnn quant bit, 4 or 8, default is 4.
  --quant_block QUANT_BLOCK
                        mnn quant block, 0 mean channle-wise, default is 128.
  --lm_quant_bit LM_QUANT_BIT
                        mnn lm_head quant bit, 4 or 8, default is `quant_bit`.

jxt1234 avatar Jun 10 '25 05:06 jxt1234

mnn's quant has much more degree of freedom than llama.cpp. Normally weight quant block=64 has the same precision to Q4_1. While mnn's 32 block has higher precision than Q4_1 and Q4_0

Image

jxt1234 avatar Jun 10 '25 05:06 jxt1234

Thanks for the response!

See https://mnn-docs.readthedocs.io/en/latest/transformers/llm.html , or execute python3 llmexport -h you can see the quant option.

I meant some definite guideline to not make mistake when choosing between QAT models trained for int4 or Q4, or maybe to clearly understand if QAT is not suitable for MNN at all. At first I assumed I should choose int4 version, but after your reply I started leaning towards the vanilla one. (Still ambiguous, as you can see.)

Which kind of gemma 3 model are you convert? For Gemma 3 4b and 1 b we test is ok. Do you update mnn to 3.2.0?

12B that are linked above. I have tested 4B, but it's still too weak. Need more.

mnn's quant has much more degree of freedom than llama.cpp. Normally weight quant block=64 has the same precision to Q4_1. While mnn's 32 block has higher precision than Q4_1 and Q4_0

Nice table. I relied on block=0 before, but I want to try 32 next time (hopefully for better quality).

Anderhar avatar Jun 10 '25 16:06 Anderhar

Well, regardless of the Gemma 3 version chosen (including non-QAT one), llmexport.py responds with AttributeError: 'NoneType' object has no attribute 'weight'.

I got same error on Qwen2.5-VL-7B-Instruct, even under the latest code repo and MNN==3.2.0. but it can work on Qwen2.5-0.5B-Instruct which is not a visual LLM.

so, is MLLM not supported? @jxt1234

NNsauce avatar Jun 19 '25 03:06 NNsauce

Image

it works after I change the value of 'lm_' of self.model_map['model']: 'model.lm_head' -> 'lm_head' (just like that value of Qwen2.5), then I pass the left converting and quanting. I dont know why?

NNsauce avatar Jun 19 '25 08:06 NNsauce

Marking as stale. No activity in 60 days.

github-actions[bot] avatar Aug 18 '25 09:08 github-actions[bot]