pSGLD
pSGLD copied to clipboard
Why is noise scaled by Ntrain in RMSProp
In SGLD_RMSprop.m
the noise is scaled by opts.N
which is set to Ntrain in DNN experiments:
https://github.com/ChunyuanLI/pSGLD/blob/master/pSGLD_DNN/algorithms/SGLD_RMSprop.m#L51
Why is this the case? In the paper (https://arxiv.org/pdf/1512.07666v1.pdf) there is no such scaling.
I also checked SGLD_Adagrad.m
and there is no scaling by Ntrain for the noise.
The scaling for the noise is for faster convergence in practice; Otherwise, we need to train the model for a long time according to the theory.
Is the choice of Ntrain
for the scaling arbitrary? Or do you think it will work in general for almost any dataset?
Ntrain is the number of data points in the training dataset.
Yes, but you could use other numbers for scaling like a constant number (100) or batch size etc. So my question is whether you expect that Ntrain is a good choice in practice and it will work well for almost any dataset. Or should we try several values for the scaling and choose the best?
I expect that Ntrain is a good choice in practice.
The "grad" is mean of the gradients computed in the mini-batch. We should use opts.N*grad to approximate the true gradient of the full dataset.
Instead, we consider the scaling issue in the stepsize "lr", and come to the update as following:
grad = lr* grad ./ pcder + sqrt(2*lr./pcder/opts.N).*randn(size(grad)) ;
However, this would take a long time to converge. In practice, I recommend:
grad = lr* grad ./ pcder + sqrt(2*lr./pcder).*randn(size(grad))/opts.N ;
Thank you for the explanation. I saw that also SGLD.m
uses the same scaling by opts.N
. So in your experience the same slow convergence holds true for the SGLD method too?
Yes, the convergence also holds for SGLD.