Megatron-DeepSpeed icon indicating copy to clipboard operation
Megatron-DeepSpeed copied to clipboard

recovering from loss spikes strategies

Open stas00 opened this issue 2 years ago • 0 comments

After having a 3->8->3 spike in the loss value a few days ago, which luckily recovered after a few hours of training, we want to discuss possible ready to use strategies that we can quickly deploy should the spike not recover.

Notes from the slack so far:

Iz Beltagy:

In case the model gets stuck in one of the spikes and doesn’t come back, we can restart it from an earlier checkpoint but shuffle the data, reset the optimizer state, switch to fp32, lower lr, change optimizer params …

Stas:

Do you think we should be prepared and have a few of these options documented from the best choice to least, or deal with it if and when it happens? e.g. I don't think we can shuffle the data, other than perhaps changing the seed?

Ryan Teehan:

I think that would be a good idea, both as a way to inform people about decisions but also for developing justifications and reasons for best practices

Iz Beltagy:

it would be great if we have these implemented and ready to be used. As for knowing which choices are more effective, that would be something we figure out empirically, and it would be one of the contributions of the project

@ibeltagy

stas00 avatar Aug 10 '21 02:08 stas00