sgd
sgd copied to clipboard
Implement five (5) different modes for setting the learning rate
I believe the user should have the following options for the learning rate.
- [ ] Manual: Should be possible to set the learning rate manually
- [ ] Auto-1dim: Automatic setting of one-dimensional rate (for speed)
- [ ] Auto-pxdim: Automatic setting of diagonal p-dimensional rate (for efficiency)
- [ ] Auto-full: Online estimation of the full matrix.
- [ ] Auto-QN: Use a Quasi-Newton scheme.
- [ ] Averaging: Use averaging.
I suggest we work on the 2 & 3 & 4 for now.
We can add the rest as we go. Any thoughts?
Ye and I just looked at the wiki. Thanks for the new method. we have a few questions about the method.
- You mentioned " α_n and D_n need to approximate the inverse of nI(θ) ". we were wondering why the inverse of nI(θ) would be the optimal learning rate.
- I think it would be helpful if you could point me to the literature about the method for the approximation of the inverse of nI(θ). (BTW, there might be a typo in " Take the inverse-square of all components Gi <- Gi^2 ". Do we take the before of Gi before squaring it? )
- Is the iterative method to calculate learning rate also applicable to 1 dim learning rate?
Thanks!
re the questions.
- It is a theoretical result that if one uses the inverse of n I(θ*) then SGD is optimal (same asymptotic variance as the MLE). I just added two papers about this in the "literature" dropbox folder.
- The SGD-QN is the following http://jmlr.org/papers/volume10/bordes09a/bordes09a.pdf It approximates the matrix in a BFGS style. Yes! there was a typo. No "inverse-square" but just square.
- It is but the method with multiple learning rates will be more efficient. We can try in the experiment to simply use the norms of the gradient of the log-likehood, and use this as a 1-dim learning rate.