mlc-llm
mlc-llm copied to clipboard
[Question] semantic description of different quantization methods
❓ General Questions
Hi,
I'd love to know about trends in different quantization methods supported by MLC For example (I made this up) ,
slowest-fastest: q0f32, q3f16_0, q4f16_0, q4f16_1, q4f32_1, q4f16_2, q4f16_autoawq
least-most memory consumption: q0f32, q3f16_0, q4f16_0, q4f16_1, q4f32_1, q4f16_2, q4f16_autoawq
response quality scoring (highest-lowest) : q0f32, q3f16_0, q4f16_0, q4f16_1, q4f32_1, q4f16_2, q4f16_autoawq
It'd be also really nice if there is existing benchmark results (of any model on any platform)
I'm particularly interested in Llama3.1-8B model on Jetson AGX orin hardware.