DeepSpeed-MII icon indicating copy to clipboard operation
DeepSpeed-MII copied to clipboard

Quantization Support for Fastgen?

Open aliozts opened this issue 2 years ago • 4 comments

Hello, does newly released fastgen support any AWQ/GPTQ quantization for the models it supports?

aliozts avatar Nov 04 '23 12:11 aliozts

Adding quantization support is a high priority item on our roadmap! We are working to add support for this soon and as the timeline becomes more concrete will share more information.

cmikeh2 avatar Nov 08 '23 17:11 cmikeh2

Is there any plan to support Mistral 8-bit quantization recently?

x66ccff avatar Jan 06 '24 04:01 x66ccff

hi @cmikeh2, is there any update on AWQ support?

aniketmaurya avatar Jan 17 '24 16:01 aniketmaurya

Of the recent techniques, SmoothQuant from MIT seems extremely promising for serving. It's W8A8 quant, so you don't need to dequantize during inference. This means that inference with SmoothQuant has better latency and throughput than with f16.

Implementation: https://github.com/AniZpZ/AutoSmoothQuant PR for vLLM: https://github.com/vllm-project/vllm/pull/1508

DreamGenX avatar Jan 27 '24 08:01 DreamGenX