mlc-llm [Question] semantic description of different quantization methods

[Question] semantic description of different quantization methods

Open phgcha opened this issue 9 months ago • 0 comments

❓ General Questions

Hi,

I'd love to know about trends in different quantization methods supported by MLC For example (I made this up) ,

slowest-fastest: q0f32, q3f16_0, q4f16_0, q4f16_1, q4f32_1, q4f16_2, q4f16_autoawq
least-most memory consumption: q0f32, q3f16_0, q4f16_0, q4f16_1, q4f32_1, q4f16_2, q4f16_autoawq
response quality scoring (highest-lowest) : q0f32, q3f16_0, q4f16_0, q4f16_1, q4f32_1, q4f16_2, q4f16_autoawq

It'd be also really nice if there is existing benchmark results (of any model on any platform)

I'm particularly interested in Llama3.1-8B model on Jetson AGX orin hardware.

Jan 09 '25 21:01 phgcha

mlc-llm mlc-llm copied to clipboard

[Question] semantic description of different quantization methods

❓ General Questions

mlc-llm
mlc-llm copied to clipboard