mamba
mamba copied to clipboard
Would you post a minimal example of training this?
Amazing work, and I'm inspired by the connections to dynamical systems.
Would you mind showing us a minimal example of training or finetuning this?
same problem
We released just the core model because it can be drop-in replaced for any model of any training/finetuning pipeline, of which there are many. Is there an example application you have in mind?
Thanks for the reply, and woh, just any pytorch training setup will do? I'm just interested in next-token prediction.
Does it get along with, say, the accelerate ecosystem for multi-node/multi-gpu? I saw transformers in setup.py, how does that work? I thought this architecture wasn't related?
I assume optimizations like flash attention are no longer relevant?
When you release larger models (fingers-crossed!!!), bitsandbytes will likely become relevant, as well as peft and QLORAs, and DeepSpeed.
But then I'm also curious about some training params, like, LR?, AdamW?, WD?
Agreed, even an example with the HuggingFace Trainer would be lovely. I am running into issues using it with HuggingFace trainer and even with causal language modeling with Transformers without Trainer. Thank you for the incredible work as well, this is amazing.
https://github.com/state-spaces/mamba/issues/6, i tried deepspeed zero 3 with HF trainer API, looks good.
I added,
- cross entropy loss.
- Transformers config interface.
- Transformers PretrainedModel interface.
The results,
- tested to save using safetensors.
- load existing checkpoints to continue pretraining.
- with 80GB VRAM, maximum batch size is 8 with 4k context length, 1 step took ~300ms.
Just saw your post, great work and tested on my end with similar success.
Geez open source is fast, here's a chattified version with simple example: https://github.com/havenhq/mamba-chat/blob/main/train_mamba.py