Reduce LM memory usage

Open levinkhho opened this issue 1 year ago • 0 comments

If CUDA is available: loads the language model in 8-bit quantized format using bitsandbytes Else: loads the LM in torch.float16

One could also look into using CTranslate2 for quantization, which would work on CPU.

https://github.com/apple/ml-mdm/issues/47

Dec 13 '24 19:12 levinkhho