GPTQ Quantization Need `use_marlin`

Open wanghaichen1 opened this issue 1 year ago • 1 comments

Feature request

refer to https://github.com/AutoGPTQ/AutoGPTQ/blob/main/README.md

2024-02-15 - (News) - AutoGPTQ 0.7.0 is released, with [Marlin](https://github.com/IST-DASLab/marlin) int4*fp16 matrix multiplication kernel support, with the argument use_marlin=True when loading models.

https://github.com/huggingface/optimum/blob/main/optimum/gptq/quantizer.py need a kernel choice config

Motivation

See benchmark with different autogptq kernel: https://github.com/huggingface/optimum/blob/main/tests/benchmark/README.md

Your contribution

PR if need

Jul 22 '24 14:07 wanghaichen1

@wanghaichen1 Try GPTQModel where we monkeypatched HF integration which replaces AutoGPTQ.

Jul 24 '24 06:07 Qubitium