mamba Implementation sensitive to token numbering

Implementation sensitive to token numbering

Open farrokhsiar opened this issue 1 year ago • 3 comments

I have tried to train the algorithm using different order of tokens, and I have realized anytime the token order is random, it generates NaN in the embedding layer. I'm not sure how to interpret this, and if this is a correct observation or not.

Jan 20 '24 00:01 farrokhsiar

The embedding layer is independent of anything related to Mamba, so I'm not sure what could cause this. Did you try removing the model and seeing if this phenomenon still happens?

Jan 20 '24 02:01 albertfgu

That is my understanding! But even after gradient clipping, the problem persisted. Like when I try to predict the next token as: 12179 4675 11374 1807, ..., after one step of the optimizer, the Network starts to generate NaNs. When I changed the token ordering to 1 2 3 4 5, ..., problem is resolved. The Embedding layer becoming NaN could mean a high gradient, but I am still not sure why it persisted after clipping.

Jan 20 '24 02:01 farrokhsiar

I once set the wrong vocabulary size and there is NAN problem, might be helpful to double check it.

Jan 20 '24 03:01 radarFudan

mamba mamba copied to clipboard

Implementation sensitive to token numbering

mamba
mamba copied to clipboard