pSGLD icon indicating copy to clipboard operation
pSGLD copied to clipboard

Why is noise scaled by Ntrain in RMSProp

Open jaak-s opened this issue 7 years ago • 7 comments

In SGLD_RMSprop.m the noise is scaled by opts.N which is set to Ntrain in DNN experiments: https://github.com/ChunyuanLI/pSGLD/blob/master/pSGLD_DNN/algorithms/SGLD_RMSprop.m#L51

Why is this the case? In the paper (https://arxiv.org/pdf/1512.07666v1.pdf) there is no such scaling.

I also checked SGLD_Adagrad.m and there is no scaling by Ntrain for the noise.

jaak-s avatar Mar 05 '17 00:03 jaak-s

The scaling for the noise is for faster convergence in practice; Otherwise, we need to train the model for a long time according to the theory.

ChunyuanLI avatar Mar 07 '17 19:03 ChunyuanLI

Is the choice of Ntrain for the scaling arbitrary? Or do you think it will work in general for almost any dataset?

jaak-s avatar Mar 07 '17 22:03 jaak-s

Ntrain is the number of data points in the training dataset.

ChunyuanLI avatar Mar 07 '17 23:03 ChunyuanLI

Yes, but you could use other numbers for scaling like a constant number (100) or batch size etc. So my question is whether you expect that Ntrain is a good choice in practice and it will work well for almost any dataset. Or should we try several values for the scaling and choose the best?

jaak-s avatar Mar 07 '17 23:03 jaak-s

I expect that Ntrain is a good choice in practice.

The "grad" is mean of the gradients computed in the mini-batch. We should use opts.N*grad to approximate the true gradient of the full dataset.

Instead, we consider the scaling issue in the stepsize "lr", and come to the update as following:

grad = lr* grad ./ pcder + sqrt(2*lr./pcder/opts.N).*randn(size(grad)) ;

However, this would take a long time to converge. In practice, I recommend:

grad = lr* grad ./ pcder + sqrt(2*lr./pcder).*randn(size(grad))/opts.N ;

ChunyuanLI avatar Mar 08 '17 01:03 ChunyuanLI

Thank you for the explanation. I saw that also SGLD.m uses the same scaling by opts.N. So in your experience the same slow convergence holds true for the SGLD method too?

jaak-s avatar Mar 08 '17 09:03 jaak-s

Yes, the convergence also holds for SGLD.

ChunyuanLI avatar Mar 08 '17 14:03 ChunyuanLI