DeepSpeed-MII
DeepSpeed-MII copied to clipboard
Quantization Support for Fastgen?
Hello, does newly released fastgen support any AWQ/GPTQ quantization for the models it supports?
Adding quantization support is a high priority item on our roadmap! We are working to add support for this soon and as the timeline becomes more concrete will share more information.
Is there any plan to support Mistral 8-bit quantization recently?
hi @cmikeh2, is there any update on AWQ support?
Of the recent techniques, SmoothQuant from MIT seems extremely promising for serving. It's W8A8 quant, so you don't need to dequantize during inference. This means that inference with SmoothQuant has better latency and throughput than with f16.
Implementation: https://github.com/AniZpZ/AutoSmoothQuant PR for vLLM: https://github.com/vllm-project/vllm/pull/1508