mamba icon indicating copy to clipboard operation
mamba copied to clipboard

Quantization

Open arman-kazemi opened this issue 1 year ago • 6 comments

Hi, Have you tried quantizing Mamba? Do you plan on releasing quantized versions? Can you share your thoughts on quantizing Mamba, given the sensitivity of the model's recurrent dynamics? Thanks

arman-kazemi avatar Jan 26 '24 23:01 arman-kazemi

We have not tried quantization, it's an open question. Would be very interesting to understand how sensitive the model is to the SSM params. E.g. I could imagine quantizing the nn.Linear weights but keep the SSM params and states in high precision.

tridao avatar Jan 26 '24 23:01 tridao

I would love an update on this

radna0 avatar Jun 18 '24 19:06 radna0

Hello, we have some initial results to share, but it is still under reviewing. Please see our pre-viewed version at https://hychiang.info/projects/quamba/

hychiang-git avatar Jul 13 '24 16:07 hychiang-git

Here's a paper being presented at the Next-Generation Sequence Modeling Workshop at ICML next week: https://arxiv.org/abs/2406.09477

The takeaway is that for quantization aware training and inference on LRA, most parameters can be quantized to below uint8, but the the recurrent matrix A/lambda is the most sensitive and performance dramatically changes under 8 bits.

This recent preprint might also be of interest: https://arxiv.org/abs/2407.12397

kmheckel avatar Jul 20 '24 16:07 kmheckel