mamba
mamba copied to clipboard
Implementation sensitive to token numbering
I have tried to train the algorithm using different order of tokens, and I have realized anytime the token order is random, it generates NaN in the embedding layer. I'm not sure how to interpret this, and if this is a correct observation or not.
The embedding layer is independent of anything related to Mamba, so I'm not sure what could cause this. Did you try removing the model and seeing if this phenomenon still happens?
That is my understanding! But even after gradient clipping, the problem persisted. Like when I try to predict the next token as: 12179 4675 11374 1807, ..., after one step of the optimizer, the Network starts to generate NaNs. When I changed the token ordering to 1 2 3 4 5, ..., problem is resolved. The Embedding layer becoming NaN could mean a high gradient, but I am still not sure why it persisted after clipping.
I once set the wrong vocabulary size and there is NAN problem, might be helpful to double check it.