Proximal operator for the entropy of location-scale families
This adds the proximal operator for the entropy of location-scale families, ProximalLocationScaleEntropy, which was proposed by J. Domke[^D2020] and later theoretically and empirically analyzed by J. Domke and myself [^DGG2023][^KJWMG2023].
The use of proximal operators is to guarantee that the scale matrix is never singular, and for this it fixes the limitations of projection operators (ClipScale). Mainly, ClipScale requires an explicit lower bound on the posterior variance, which is arbitrary. Even then, if the lower bound is too loose, the algorithm may be unstable depending on the initialization and the stepsize. In fact, when I experimented with the parameter-free optimization algorithms currently provided by AdvancedVI, DoG and DoWG tend to be very aggressive in terms of stepsize, and ClipScale showed instabilities.
In the context of Turing, the combination of ProximalLocationScaleEntropy and DoWG or DoG should provide a robust tuning-free default setting for variational inference. (This is why I am working this before Turing integration.)
Proximal operators depend on the internal of the optimization algorithm in use. This is fairly straightforward for algorithms that reduce everything into a scalar stepsize like DoG and DoWG. For those who operate a vector-valued stepsize, things are less straightforward.
[^D2020]: Domke, Justin. "Provable smoothness guarantees for black-box variational inference." International Conference on Machine Learning. PMLR, 2020. [^DGG2023]: Domke, Justin, Robert Gower, and Guillaume Garrigos. "Provable convergence guarantees for black-box variational inference." Advances in neural information processing systems 36 (2023): 66289-66327. [^KJWMG2023]: Kim, Kyurae, et al. "On the convergence of black-box variational inference." Advances in Neural Information Processing Systems 36 (2023): 44615-44657.
ah, sorry for slow response on this! I'll take a look as soon as I got some free time (probably Wednesday).
@sunxd3 @mhauru @yebai Could we move this forward?
Oops, sorry for forgetting about this. I'll take a look tomorrow morning.
Small technical question: am I reading it correctly that AdvancedVI right now uses the linear parametrization?
Small technical question: am I reading it correctly that AdvancedVI right now uses the linear parametrization?
Yes, the default settings do, hence the involvement of ClipScale or ProximalLocationScaleEntropy, but users could implement their nonlinear parameterized location-scales if they wish to.
Hmmm... seems like mapreduce with Zygote is broken again.