Quantization

Open KnutJaegersberg opened this issue 2 years ago • 2 comments

🚀 Feature

HF transformers implements 8 bit and 4 bit quantization. It would be nice if that feature can be leveraged for the xlm-r-xxl machine translation eval model.

Motivation

The large xlm-r-xxl model is too big for most commodity gpus. To increase access to top performance translation eval, please implement a quantize version.

Alternatives

I have seen a few libraries which quantize bert models outside the HF ecosystem.

Additional context

I tried to load the big model in 8 bit with HF, without autodevice, I could load the model, which then used 14gb vram but I don't know how to use it.

Sep 29 '23 08:09 KnutJaegersberg

Loading on 8bit and using flashattention would be great enhancements. There is a good example of RoBERTa with flash-attention.

Oct 02 '23 20:10 ricardorei

This also connects to @BramVanroy suggestion to use better transformer (#117 )

Oct 02 '23 20:10 ricardorei