mamba
mamba copied to clipboard
Quantization
Hi, Have you tried quantizing Mamba? Do you plan on releasing quantized versions? Can you share your thoughts on quantizing Mamba, given the sensitivity of the model's recurrent dynamics? Thanks
We have not tried quantization, it's an open question. Would be very interesting to understand how sensitive the model is to the SSM params. E.g. I could imagine quantizing the nn.Linear weights but keep the SSM params and states in high precision.
I would love an update on this
Hello, we have some initial results to share, but it is still under reviewing. Please see our pre-viewed version at https://hychiang.info/projects/quamba/
Here's a paper being presented at the Next-Generation Sequence Modeling Workshop at ICML next week: https://arxiv.org/abs/2406.09477
The takeaway is that for quantization aware training and inference on LRA, most parameters can be quantized to below uint8, but the the recurrent matrix A/lambda is the most sensitive and performance dramatically changes under 8 bits.
This recent preprint might also be of interest: https://arxiv.org/abs/2407.12397