mamba Quantization

Quantization

Open arman-kazemi opened this issue 1 year ago • 6 comments

Hi, Have you tried quantizing Mamba? Do you plan on releasing quantized versions? Can you share your thoughts on quantizing Mamba, given the sensitivity of the model's recurrent dynamics? Thanks

Jan 26 '24 23:01 arman-kazemi

We have not tried quantization, it's an open question. Would be very interesting to understand how sensitive the model is to the SSM params. E.g. I could imagine quantizing the nn.Linear weights but keep the SSM params and states in high precision.

Jan 26 '24 23:01 tridao

I would love an update on this

Jun 18 '24 19:06 radna0

Hello, we have some initial results to share, but it is still under reviewing. Please see our pre-viewed version at https://hychiang.info/projects/quamba/

Jul 13 '24 16:07 hychiang-git

Here's a paper being presented at the Next-Generation Sequence Modeling Workshop at ICML next week: https://arxiv.org/abs/2406.09477

The takeaway is that for quantization aware training and inference on LRA, most parameters can be quantized to below uint8, but the the recurrent matrix A/lambda is the most sensitive and performance dramatically changes under 8 bits.

This recent preprint might also be of interest: https://arxiv.org/abs/2407.12397

Jul 20 '24 16:07 kmheckel

mamba mamba copied to clipboard

Quantization

mamba
mamba copied to clipboard