Easy-Transformer
Easy-Transformer copied to clipboard
[Proposal] Add support for Mamba
Proposal
Mamba shows "best-in-class on every single evaluation result, and generally matches baselines at twice the model size." It won't be long before we see more language models in the wild with the Mamba architecture.
Paper: https://arxiv.org/ftp/arxiv/papers/2312/2312.00752.pdf Code: https://github.com/state-spaces/mamba
If there is support for the proposal, I would like to work on the implementation.
I'm excited for people to work on adding new architectures to TransformerLens! :)
However, your figure is not the most important figure in that paper. None of the models use the "Transformer++" Swiglu+Parallel Attention+GroupedQuery+overtraining that Llama and Mistral use -- when comparing to Transformer++, Mamba is not a clear winner. But it may be better!
Ahh good catch. Thanks for pointing that out. As the adoption picks up, I'd be interested to see the evaluation metrics compared to Transformer++ based architectures.
In the meantime I'll get started on adding Mamba and should have a PR out soon.
I could also help, would love to do some cool mech interp things on state space models!
I could also help, would love to do some cool mech interp things on state space models!
That would be awesome! I started some work here. Feel free to take a look and let me know what you think.