pie Allow for using other Learning Rate Schedulers and Optimizers

Hey ! I started reading about some other optimizers, as things went through my news feed (stuff like this or that).

I ended up trying to implement it in pie but wanted to see first what would be the results. The test were done as follow: same training set (~500k words), same learning rate, same testing set (~ 63k tokens), cuda, 10 run per configuration. No optimization were done.

For optimizers, were tested Ranger and Adam. I did not try anything else For learning rate, were tested ReduceLROnPlaeau, CosineAnnealing, Delayed(CosineAnnealing). Patience overall is 15 steps with improvement. CosineAnnealing T0 is 40, Delay is 10.

Basically, Ranger does not outperform Adam (maybe with other parameters, who knows, as the beta is different from Adam) but Delay(CosineAnnealing) is reaching same results in 40% less time.

If you are okay, PR will be under way.

Results:

Dec 06 '20 09:12 PonteIneptique

We could include an option to select the lr scheduler. That's easy since it's just swapping the pytorch lr scheduler and adapting the step call. If you have the code around feel free to push a PR and we can see how to include it!

Dec 06 '20 16:12 emanjavacas

So, small update with my old branch, regarding Flat(Cosine)(Delay=10, CosineTmax=40, patience=11): I can definitely recommend it. On a corpus of 1.5M tokens (3 times the previous one), it's not only faster, it's a also scoring higher with less deviation:

Dec 09 '20 06:12 PonteIneptique

Hey @emanjavacas :) I was very bugged by the results on Ranger on the first batch, because I remembered running small trainings and having better results than with Adam. Then I remembered I read that Ranger takes a higher learning rate to start with, and that I did use a higher one for my preliminary tests. So I did it as well with the LASLA corpus, and I scored better results (note that my Adam LR is fine tuned, after close to 100 run to find the best hyperparams), with a 10x higher LR than my Adam one:

Dec 11 '20 08:12 PonteIneptique

I also found out I am using CosineAnnealing the wrong way, but it still perform better than Adam: instead of using T_max as the cycle for which you'd find a cosine curve of LR, I have been using it as a slope (the LR is badly offset, it should be 10 epochs on the right):

Dec 11 '20 08:12 PonteIneptique

Coming back with new experiences, regarding Ranger vs Adam.

I have been playing with single tasks models (which indeed improve when fine tuned correctly), and Ranger clearly yields results that are more stable:

The second before last and the second are the same config, just the optimizer is changing (without finetuning optimizer hyperparams)

Mar 01 '21 16:03 PonteIneptique

pie pie copied to clipboard

Allow for using other Learning Rate Schedulers and Optimizers

pie
pie copied to clipboard