mamba Vanishing gradient problem with more layer

Vanishing gradient problem with more layer

Open yjdy opened this issue 6 months ago • 2 comments

Dear author, I stacked multiple mamba layers to form a model, and trained the model from scatch. When I just stacked 4 layers, the perfomance was very good. So I decided to increase the number of layers.

But, when I stacked 8 layers, I met vanishing gradient problem. Specifically, the model stayed at low performance, which will not increase with training. I have increased training data to 20 times more than before, but the problem is still there. Besides, I have tested several methods, such as different lr, resnet etc, but I just can not solve the problem.

Have you met these problem before? Or any valuable suggestions?

Best regards

Aug 15 '24 08:08 yjdy

mamba mamba copied to clipboard

Vanishing gradient problem with more layer

mamba
mamba copied to clipboard