autoclip
autoclip copied to clipboard
Interaction with learning rate schedule
Has there been any research on how this strategy interacts with a learning rate schedule? Especially for something extreme like the one-cycle policy (super convergence). It seems like the history of the scale of the gradient would be dominated by changes in the learning rate. I found this paper that touches on the subject but doesn't propose any theory behind or solution to the interaction between the two.
As expected, AutoClip doesn't interact well with cosine annealing