design-docs
design-docs copied to clipboard
Add Laplace approximation as algorithm
This is a proposal for a Laplace approximation algorithm to Stan.
Rendered output is here
Edit: I just totally changed the text here. A lot is changed since this the algorithm I initially put up was just wrong.
@jgabry can you look through this?
Now this just mirrors the Rstan way of doing this. The prototype was relatively easy to implement.
If someone can approve that this is vaguely the right way to go (@jgabry, @avehtari or anyone), then I'll talk to people about interfaces and how this should go in.
Prototype code is there and can be run.
I now reviewed only the design. After it's fixed, I can review code, too.
It sounds the same. The code is basically from
https://github.com/stan-dev/rstan/blob/develop/rstan/rstan/R/stanmodel-class.R#L446
onwards. The main thing is that it is better to finite-diff the autodiffed gradient than to finite-diff the log_prob function twice.
On Mon, Mar 16, 2020 at 4:01 PM Ben Bales [email protected] wrote:
@bbbales2 commented on this pull request.
In designs/0016-hessian_optimize_cmdstan.md https://github.com/stan-dev/design-docs/pull/16#discussion_r393279345:
+Providing draws instead of the Laplace approximation itself is rather inefficient, but it is the easiest thing to code. + +We also have to deal with possible singular Hessians. This is why I also added the laplace_diag_shift to overcome these. They'll probably be quite common, especially with the Hessians computed with finite differences. + +# Rationale and alternatives +[rationale-and-alternatives]: #rationale-and-alternatives + +Another design would be the print the Hessian on the unconstrained space and let users handle the sampling and the parameter transformation. The issue here is there is no good way for users to do these parameter transformations outside of certain interfaces (at least Rstan, maybe PyStan). + +Another design would be to print a Hessian on the constrained space and let users handle the sampling. In this case users would also be expected to handle the constraints, and I don't know how that would work practically rejection sampling maybe?) + +# Prior art +[prior-art]: #prior-art + +Rstan does a version of this already.
I should check that it's actually the same. @bgoodri https://github.com/bgoodri is this actually the same?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stan-dev/design-docs/pull/16#discussion_r393279345, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZ2XKRHUJYURDYNZB4AVILRH2AR5ANCNFSM4LD3MERA .
In terms of right way to go the autodiff is better (speed and precision both) where supported, but doesn't exist for some of our higher-order solvers like ODEs.
@avehtari this should be available to test here:
git clone --recursive --branch=feature/design-doc-16-optimize-hessian https://github.com/stan-dev/cmdstan.git cmdstan-hessian
Currently I only implemented stuff for the lbfgs optimizer. Could you give this a go and check me for statistical correctness in terms of what I'm spitting out?
There are some software design issues:
- Need to share calculations between all the different optimizers (lbfgs, bfgs, and Newton)
- If we get errors on log_p evaluation, what to do
- If we get errors on generated quantities evaluation, what to do
- Do we print the laplace draws in the output file? If we do we have a csv with two different formats of outputs. The optimization output has less columns than the sampling output, and they also are different things.
I'd prefer to print a warning and skip the sample for 2 and 3 instead of blowing up. In that case if you request N posterior draws you might only get M < N actual draws. Blowing up is bad cause then you have to redo the whole optimization (and things might still blow up). I think if we do rejection sampling to make sure we get N posterior draws we'd be in danger of people writing models that never finish.
Edit: I guess options for 4 are:
a. Leave the different outputs in the same file b. Get rid of the actual MAP estimate. Just print samples from Laplace approximation c. Make a laplace_file argument and put the draws in there
@bgoodri I'm currently just too lazy to write and test the hessian_auto equivalent that uses vars. If someone wants to yell at me I guess I'd get unlazy and go do it.
@bob-carpenter I think he was just saying finite diff reverse mode to get the Hessian. That makes a lot of sense, but I just wanted to lean on what was available in stan-dev/math instead of writing my own.
You are going to have more indeterminate Hessians if you use finite differences twice.
On Tue, Mar 17, 2020, 3:01 PM Ben Bales [email protected] wrote:
@avehtari https://github.com/avehtari this should be available to test here:
git clone --recursive --branch=feature/design-doc-16-optimize-hessian https://github.com/stan-dev/cmdstan.git cmdstan-hessian
Currently I only implemented stuff for the lbfgs optimizer. Could you give this a go and check me for statistical correctness in terms of what I'm spitting out?
There are some software design issues:
- Need to share calculations between all the different optimizers (lbfgs, bfgs, and Newton)
- If we get errors on log_p evaluation, what to do
- If we get errors on generated quantities evaluation, what to do
- Do we print the laplace draws in the output file? If we do we have a csv with two different formats of outputs. The optimization output has less columns than the sampling output, and they also are different things.
I'd prefer to print a warning and skip the sample for 2 and 3 instead of blowing up. In that case if you request N posterior draws you might only get M < N actual draws. Blowing up is bad cause then you have to redo the whole optimization (and things might still blow up). I think if we do rejection sampling to make sure we get N posterior draws we'd be in danger of people writing models that never finish.
@bgoodri https://github.com/bgoodri I'm currently just too lazy to write and test the hessian_auto equivalent that uses vars. If someone wants to yell at me I guess I'd get unlazy and go do it.
@bob-carpenter https://github.com/bob-carpenter I think he was just saying finite diff reverse mode to get the Hessian. That makes a lot of sense, but I just wanted to lean on what was available in stan-dev/math instead of writing my own.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stan-dev/design-docs/pull/16#issuecomment-600269873, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZ2XKS4HTJHZFTMDXZVNW3RH7JIPANCNFSM4LD3MERA .
Sigh I guess I'll eat my vegetables. stan/math/rev/functor/finite_diff_hessian_auto.hpp or something incoming, but should still be able to talk through the other issues before I get this done.
Do we print the laplace draws in the output file? If we do we have a csv with two different formats of outputs. The optimization output has less columns than the sampling output, and they also are different things.
Change the output for the optimization also when there are no approximate draws? If I remember correctly advi has the modal value as the first row and if the number of draws is > 0, the rest of the rows are draws. optimizing could follow that
I believe that the discussion is missing a critical point -- the current optimizer behavior is not appropriate for computing posterior Laplace approximations. The RStan
reference implementation has never been correct.
Remember that the current optimizer doesn't include Jacobian corrections, so instead of computing a maximum of the posterior density it's actually computing the maximum of a penalized likelihood function. The Hessian at this penalized maximum likelihood could be used for defining confidence intervals (although because of the penalty functions they wouldn't have clear coverage properties) but the samples wouldn't have any meaning in this context.
In order to compute a Laplace approximation the optimizer would have to run with the Jacobian corrections turned on so that the maximum posterior density and the Hessian of the right posterior density is computed. This is sufficiently different behavior that it would at the very least warrant its own method rather than trying to force it into an inappropriate penalized maximum likelihood context.
maximum of the posterior density it's actually computing the maximum of a penalized likelihood function.
Can you say this in terms of unconstrained and constrained space? We want the mode in the unconstrained space. Do we get that or something else (I've seen the explanation before, but keep forgetting enough to be uncertain)?
but the samples wouldn't have any meaning in this context.
When using importance sampling they have meaning of being draws from a proposal distribution. It can be that this is not the optimal proposal distribution, but they have meaning and diagnostic when that proposal distribution is bad. Of course we would like to have the best proposal distribution.
maximum of the posterior density it's actually computing the maximum of a penalized likelihood function.
Can you say this in terms of unconstrained and constrained space? We want the mode in the unconstrained space. Do we get that or something else (I've seen the explanation before, but keep forgetting enough to be uncertain)?
The current code computes the mode on the constrained space, not the unconstrained space.
but the samples wouldn't have any meaning in this context.
When using importance sampling they have meaning of being draws from a proposal distribution. It can be that this is not the optimal proposal distribution, but they have meaning and diagnostic when that proposal distribution is bad. Of course we would like to have the best proposal distribution.
Sure, but is that worth changing the output of optimizing and confusing users as to its meaning? I think that if samples are included then many users will use them as if they were exact samples or samples from the Laplace approximation and not weights things carefully enough.
To be clear I have no problem with adding this functionality, I just think that it would be much clearer to have a new route with this functionality that wraps the internal optimizer and adds all of this post processing. That way the current optimizer can stay as it is and the new route could turn on the Jacobian corrections as needed.
To put it another way, we do the optimization on the unconstrained space, but turn off the Jacobian adjustment for the change of variables. That's easy enough to turn on from the model class---it's just a template parameter. So maybe we need another flag on the service interface for optimization for +/- Jacobian.
I'm also with @betanalpha that we don't want to change current behavior, especially by automatically returning something much larger.
The current code computes the mode on the constrained space, not the unconstrained space.
I wasn't thinking about that. Good catch. I guess these are just different functions with and without the adjustments.
I just think that it would be much clearer to have a new route with this functionality that wraps the internal optimizer and adds all of this post processing.
Sounds like this would be better suited as a new algorithm? That way there's no point estimate/sample output confusion either -- the Laplace algorithm only samples from a Laplace approximation to the posterior.
So we'd have (1) MLE without Jacobian correction, with std errors from the Hessian on the unconstrained scale translated back to constrained, and (2) Laplace approximation using Jacobian correction, with a set of draws returned.
It'd be nice to retrieve the parameters to the normal for the Laplace, as it'd let you generate more draws in the future. But that's very much a secondary consideration compared to getting the draws out in our standard interface.
Laplace approximation using Jacobian correction
I think we'd only do this, however we end up doing it interface-wise. At least, this is what I meant to do so that is my goal
The current code computes the mode on the constrained space, not the unconstrained space.
Thanks Mike, now I remember we have discussed this same issue several times before! It seems I just keep forgetting it because the mode and Hessian in unconstrained space so much more natural for me. Great if we can now have also mode and Hessian in unconstrained space.
This popped up over here: https://github.com/stan-dev/pystan/issues/694
I still need it for my own purposes too (current just using a version of cmdstan with this hacked in).
Would a new algorithm be appropriate? So algorithms would be sampling, optimization, variational, laplace? If so I'll rewrite the design doc with that in mind and we can iterate on it.
Makes sense to me.
On Apr 16, 2020, at 8:21 AM, Ben Bales [email protected] wrote:
This popped up over here: stan-dev/pystan#694 https://github.com/stan-dev/pystan/issues/694 I still need it for my own purposes too (current just using a version of cmdstan with this hacked in).
Would a new algorithm be appropriate? So algorithms would be sampling, optimization, variational, laplace? If so I'll rewrite the design doc with that in mind and we can iterate on it.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stan-dev/design-docs/pull/16#issuecomment-614618142, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZYCUMLPABIZECGFZ2U3ALRM3Z5FANCNFSM4LD3MERA.
@avehtari @betanalpha @bgoodri this is ready for comments again.
I rewrote it as a new algorithm since the first thing was just wrong. I think the differences @betanalpha pointed out between what we want to do here and optimization are enough that I kinda don't want to put them together, even if the gears are basically the same. I dunno. Opinions welcome.
It's kinda awkward that it'll share all the same parameters as optimization but be a different algorithm, but I guess that's just interface overhead.
I think @bgoodri is right and this should be written with finite differences of gradients instead of just finite differences. I was working with this code hacked together on another project and this was important (and not easily diagnosed). It makes me want to do the full second order autodiff, but we can't with the higher order algorithms.
Speaking of Hessians, we should also change the testing framework to use finite diffs of gradients rather than double finite diffs for Hessians. I think it could improve the tolerances dramatically.
how to handle errors evaluating the log density
why not then output NA, -Inf, or Inf based on what the log density evaluation returns?
An alternative that seems appealing at first is printing out the mode and hessian so that it would be possible for people to make their own posterior draws.
This would be useful if people want more draws without need to run the optimizer and Hessian computation again, but it's different behavior and would need to be considered together with variational output.
how to handle errors evaluating the log density
why not then output NA, -Inf, or Inf based on what the log density evaluation returns?
Errors raise exceptions, so there's not a return value. We can also have NaN propagate through (there's no NA in Stan like in R, but presumably we're talking about not-a-number).
why not then output NA, -Inf, or Inf based on what the log density evaluation returns?
I don't think there's a way to write these to csv files. Maybe we already have a convention for it though?
This would be useful if people want more draws without need to run the optimizer and Hessian computation again
It's a drawback. Getting draws is somewhat expensive since we evaluate the model log density, so there's that.
Lookin' for some feeeedback. Otherwise I'll just do it.
The actual new new code for this won't be much, but it'll be a lot of interface bits.
I just want to clarify that this design doc does not address the computational validation needed for an algorithm to be considered for inclusion in the Stan Algorithm library.
On Apr 30, 2020, at 2:44 AM, Aki Vehtari [email protected] wrote:
@avehtari approved this pull request.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stan-dev/design-docs/pull/16#pullrequestreview-403241909, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALU3FUVFKJ564EADCDM6ITRPEM4HANCNFSM4LD3MERA.
@betanalpha --- what would you like to see in the way of validation for something like this?
The current policy is at https://github.com/stan-dev/stan/wiki/Proposing-Algorithms-for-Inclusion-Into-Stan https://github.com/stan-dev/stan/wiki/Proposing-Algorithms-for-Inclusion-Into-Stan.
On Apr 30, 2020, at 11:56 AM, Bob Carpenter [email protected] wrote:
@betanalpha https://github.com/betanalpha --- what would you like to see in the way of validation for something like this?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stan-dev/design-docs/pull/16#issuecomment-621943235, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALU3FXP22W3RB7H776O2E3RPGNTXANCNFSM4LD3MERA.
I had thought this was more like fix for the previous algorithm, but I also now realize the design document is not mentioning diagnostics. Laplace algorithm itself is not novel and there is extensive literature demonstrating when it works and when it fails. PSIS paper https://arxiv.org/abs/1507.02646 describes the diagnostics we can use in the same way as we have diagnostics for MCMC. We can compute Monte Carlo standard errors and diagnostics when the Laplace approximation fails and when MCSEs are not reliable.
I had thought this was more like fix for the previous algorithm, but I also now realize the design document is not mentioning diagnostics. Laplace algorithm itself is not novel and there is extensive literature demonstrating when it works and when it fails. PSIS paper https://arxiv.org/abs/1507.02646 https://arxiv.org/abs/1507.02646 describes the diagnostics we can use in the same way as we have diagnostics for MCMC. We can compute Monte Carlo standard errors and diagnostics when the Laplace approximation fails and when MCSEs are not reliable.
It’s more complicated than that. The Laplace algorithm has never been in Stan. We started with dynamic HMC for estimating posterior expectation values and penalized maximum likelihood for computing point estimates. ADVI was shoehorned in later somewhat awkwardly as another method for (poorly) estimating posterior expectation values.
We’ve never had a MAP algorithm or Laplace or anything else that claims to do posterior expectation value estimation. The RStan developers added the Hessian hack on top of the penalized maximum likelihood algorithm but that was a unilateral decision of the RStan developer and never part of an official algorithm discussion. Plus, as previously discussed, it was fundamentally flawed as it was based on the wrong optimum.
There have been numerous MCMC algorithms developed in Stan that were
never exposed because they were too fragile (random walk, ensemble, etc).
for typical user models but would work fine for the same models on which
the Laplace approximation is reasonably accurate. The precedent, other
than the messy political situation of ADVI, has been to not include any
algorithms that aren’t reasonably robust over typical user models. Diagnostics
are necessary but also not sufficient; adding fragile algorithms that work for
small classes of models only confuses the Stan user experience. This is
discussed further in https://github.com/stan-dev/stan/wiki/Proposing-Algorithms-for-Inclusion-Into-Stan https://github.com/stan-dev/stan/wiki/Proposing-Algorithms-for-Inclusion-Into-Stan.
I agree that there is plenty of literature on the Laplace algorithm, but I disagree that there is solid theory backing up the robustness of the method in the preasymptotic, mostly non-log concave regime where typical user models live. Without that relevant theory we’ll need substantial empirical evidence for consideration of inclusion.