mlc-llm [Question] Should we expect generation quality comparable to gguf with 4-bit quantization?

[Question] Should we expect generation quality comparable to gguf with 4-bit quantization?

Open bene-ges opened this issue 6 months ago • 0 comments

❓ General Questions

While the inference speed is 2-3 times faster than llama.cpp, I observe some metrics degradation.

For example, I have a simple test to do some punctuation/capitalization/correction after ASR, and measure word error rate against the reference.

I tried the following checkpoints from official mlc-ai repo:

Llama-3-8B-Instruct-q4f16_1-MLC
Llama-3.1-8B-Instruct-q4f16_1-MLC

and compared to llama.cpp with corresponding gguf models with quantization Q_4_K_M.

for Llama-3-8B-Instruct-q4f16_1-MLC I observe wer degradation from 40% to 46% compared to llama.cpp. for Llama-3.1-8B-Instruct-q4f16_1-MLC I observe wer degradation from 47% to 60%

So the question: is this expected or something is wrong?

I ran my tests both on TeslaP100 and AMD Instinct, getting same results.

Apr 18 '25 21:04 bene-ges

mlc-llm mlc-llm copied to clipboard

[Question] Should we expect generation quality comparable to gguf with 4-bit quantization?

❓ General Questions

mlc-llm
mlc-llm copied to clipboard