whisper.cpp [Feature request] Implement CPU dynamic quantization

[Feature request] Implement CPU dynamic quantization

Open pablogranolabar opened this issue 1 year ago • 5 comments

e.g. https://github.com/MiscellaneousStuff/openai-whisper-cpu

Oct 27 '22 18:10 pablogranolabar

@pablogranolabar would this also cut the model size? By a factor of 4?

Nov 02 '22 07:11 jafri

Yah that's the hope. Digging into it today to do some memory profiling.

Nov 07 '22 19:11 pablogranolabar

Can you provide some more details how the "dynamic quantization" works in PyTorch? If it is just converting the weights to 8-bit floating point numbers, then the memory reduction factor will be at most x2.

Nov 07 '22 20:11 ggerganov

@ggerganov for FP16, yup. For FP32, 4x. As far as I understand your implementation switches between the two, so the benefit might be slightly more than 2x? The up-to-date documentation of PyTorch is here.

The performance increase is not necessarily parallel, but might be similar. Though, I imagine the increase would be more obvious for non-ARM users without FP16 vector arithmetic. The danger here is the loss of accuracy, I am not sure how robust are the whisper models for this and whether the above repo has anything to remedy that (here are PyTorch recommendations on this). Overall, small models should not be subjected to quantization AFAIK but the relatively large models might benefit immensely.

Great work BTW, loving this repo :)

Nov 09 '22 22:11 meakbiyik

Yes - there are some tensors from the model that are currently FP32 instead of FP16, because it was easier to first implement the operations in FP32 mode. See this comment for more information: https://github.com/ggerganov/whisper.cpp/issues/132#issuecomment-1311891779

At some point we should convert all tensors of the model to FP16 - this is what the original model uses, so it should be stable. But I am not really worried for now, because I don't expect big performance benefit - it's mostly 1-dimensional bias tensors left to convert which are very small.

Nov 11 '22 16:11 ggerganov

whisper.cpp whisper.cpp copied to clipboard

[Feature request] Implement CPU dynamic quantization

whisper.cpp
whisper.cpp copied to clipboard