mamba Any suggestions for regularization?

trafficstars

Dear Mamba-SSM team, congratulations on your success! Obviously many of us are excited about exploring the applications of your work.

Since there's no dropout in your model, what do you suggest for imposing regularizations? I applied the drop-in Mamba replacement that was used with Karpathy's GPT example to my MIDI Transformer Colab. (which is paired with a pre-Mamba blog post).

Compared to the original multi-head attention version, the Mamba-powered version runs faster, but also overfits a lot more. Meaning that losses on the validation set bottom out sooner but higher with Mamba than with the vanilla multi-head attention.

So far, I've tried

increasing weight decay
clipping gradient values

...trying values expressed in your paper. But nothing I've tried seems to have any regularizing effect.

What Mamba parameters would you suggest tweaking to improve generalization?

Thanks!

Dec 31 '23 20:12 drscotthawley

I’m very interested to see the response here from the authors, but I’ll say something potentially wrong.

i’m not familiar with the architecture used to create MIDI using the transformer mechanism, but in the original MAMBA paper they were pretty clear that continuous signals do not benefit form MAMBA, and that something like S4 seemed to outperform. So the performance you may be seeing may have less to do with regularization and more to do with the application of MAMBA to this problem.

Jan 05 '24 19:01 chazzmoney

You can use dropout, just like Transformers. It's not implemented here but you can add it.

Jan 05 '24 20:01 tridao

You can use dropout, just like Transformers. It's not implemented here but you can add it.

This thread may be more useful if tagged as a feature request then? (Just for traceability to mitigate it from being lost)

Feb 02 '24 23:02 ElliottDyson

This is still an issue; I trained the model on 10 epochs for a specific use case and the loss just decreases drastically to overfit quickly Screenshot 2024-07-06 at 15 52 09

Jul 06 '24 14:07 Anri-Lombard

Has anyone implemented this with some success?

Jul 06 '24 14:07 Anri-Lombard

I find that adding dropout decreases performance for state space models. Does anyone else also observe this phenomenon?

Jul 12 '24 04:07 windsornguyen

I added dropout and it actually improved performance for me. training_convergence

Jul 12 '24 05:07 Anri-Lombard

Although MAMBA still does very bad in my use case, it just learned better.

Jul 12 '24 05:07 Anri-Lombard

mamba mamba copied to clipboard

Any suggestions for regularization?

mamba
mamba copied to clipboard