sgd Implement five (5) different modes for setting the learning rate

Implement five (5) different modes for setting the learning rate

Open ptoulis opened this issue 10 years ago • 2 comments

I believe the user should have the following options for the learning rate.

[ ] Manual: Should be possible to set the learning rate manually
[ ] Auto-1dim: Automatic setting of one-dimensional rate (for speed)
[ ] Auto-pxdim: Automatic setting of diagonal p-dimensional rate (for efficiency)
[ ] Auto-full: Online estimation of the full matrix.
[ ] Auto-QN: Use a Quasi-Newton scheme.
[ ] Averaging: Use averaging.

I suggest we work on the 2 & 3 & 4 for now.
We can add the rest as we go. Any thoughts?

Jan 03 '15 00:01 ptoulis

Ye and I just looked at the wiki. Thanks for the new method. we have a few questions about the method.

You mentioned " α_n and D_n need to approximate the inverse of nI(θ) ". we were wondering why the inverse of nI(θ) would be the optimal learning rate.
I think it would be helpful if you could point me to the literature about the method for the approximation of the inverse of nI(θ). (BTW, there might be a typo in " Take the inverse-square of all components Gi <- Gi^2 ". Do we take the before of Gi before squaring it? )
Is the iterative method to calculate learning rate also applicable to 1 dim learning rate?

Thanks!

Jan 03 '15 03:01 lantian2012

re the questions.

It is a theoretical result that if one uses the inverse of n I(θ*) then SGD is optimal (same asymptotic variance as the MLE). I just added two papers about this in the "literature" dropbox folder.
The SGD-QN is the following http://jmlr.org/papers/volume10/bordes09a/bordes09a.pdf It approximates the matrix in a BFGS style. Yes! there was a typo. No "inverse-square" but just square.
It is but the method with multiple learning rates will be more efficient. We can try in the experiment to simply use the norms of the gradient of the log-likehood, and use this as a 1-dim learning rate.

Jan 03 '15 04:01 ptoulis