whisper.cpp icon indicating copy to clipboard operation
whisper.cpp copied to clipboard

[Feature request] Implement CPU dynamic quantization

Open pablogranolabar opened this issue 1 year ago • 5 comments

e.g. https://github.com/MiscellaneousStuff/openai-whisper-cpu

pablogranolabar avatar Oct 27 '22 18:10 pablogranolabar

@pablogranolabar would this also cut the model size? By a factor of 4?

jafri avatar Nov 02 '22 07:11 jafri

Yah that's the hope. Digging into it today to do some memory profiling.

pablogranolabar avatar Nov 07 '22 19:11 pablogranolabar

Can you provide some more details how the "dynamic quantization" works in PyTorch? If it is just converting the weights to 8-bit floating point numbers, then the memory reduction factor will be at most x2.

ggerganov avatar Nov 07 '22 20:11 ggerganov

@ggerganov for FP16, yup. For FP32, 4x. As far as I understand your implementation switches between the two, so the benefit might be slightly more than 2x? The up-to-date documentation of PyTorch is here.

The performance increase is not necessarily parallel, but might be similar. Though, I imagine the increase would be more obvious for non-ARM users without FP16 vector arithmetic. The danger here is the loss of accuracy, I am not sure how robust are the whisper models for this and whether the above repo has anything to remedy that (here are PyTorch recommendations on this). Overall, small models should not be subjected to quantization AFAIK but the relatively large models might benefit immensely.

Great work BTW, loving this repo :)

meakbiyik avatar Nov 09 '22 22:11 meakbiyik

Yes - there are some tensors from the model that are currently FP32 instead of FP16, because it was easier to first implement the operations in FP32 mode. See this comment for more information: https://github.com/ggerganov/whisper.cpp/issues/132#issuecomment-1311891779

At some point we should convert all tensors of the model to FP16 - this is what the original model uses, so it should be stable. But I am not really worried for now, because I don't expect big performance benefit - it's mostly 1-dimensional bias tensors left to convert which are very small.

ggerganov avatar Nov 11 '22 16:11 ggerganov