mamba
mamba copied to clipboard
Vanishing gradient problem with more layer
Dear author, I stacked multiple mamba layers to form a model, and trained the model from scatch. When I just stacked 4 layers, the perfomance was very good. So I decided to increase the number of layers.
But, when I stacked 8 layers, I met vanishing gradient problem. Specifically, the model stayed at low performance, which will not increase with training. I have increased training data to 20 times more than before, but the problem is still there. Besides, I have tested several methods, such as different lr, resnet etc, but I just can not solve the problem.
Have you met these problem before? Or any valuable suggestions?
Best regards