ml-mdm
ml-mdm copied to clipboard
Reduce LM memory usage
If CUDA is available: loads the language model in 8-bit quantized format using bitsandbytes Else: loads the LM in torch.float16
One could also look into using CTranslate2 for quantization, which would work on CPU.
https://github.com/apple/ml-mdm/issues/47