mlc-llm
mlc-llm copied to clipboard
[Question] Should we expect generation quality comparable to gguf with 4-bit quantization?
❓ General Questions
While the inference speed is 2-3 times faster than llama.cpp, I observe some metrics degradation.
For example, I have a simple test to do some punctuation/capitalization/correction after ASR, and measure word error rate against the reference.
I tried the following checkpoints from official mlc-ai repo:
Llama-3-8B-Instruct-q4f16_1-MLC
Llama-3.1-8B-Instruct-q4f16_1-MLC
and compared to llama.cpp with corresponding gguf models with quantization Q_4_K_M.
for Llama-3-8B-Instruct-q4f16_1-MLC I observe wer degradation from 40% to 46% compared to llama.cpp.
for Llama-3.1-8B-Instruct-q4f16_1-MLC I observe wer degradation from 47% to 60%
So the question: is this expected or something is wrong?
I ran my tests both on TeslaP100 and AMD Instinct, getting same results.