design-docs Add Laplace approximation as algorithm

This is a proposal for a Laplace approximation algorithm to Stan.

Rendered output is here

Edit: I just totally changed the text here. A lot is changed since this the algorithm I initially put up was just wrong.

Mar 08 '20 17:03 bbbales2

@jgabry can you look through this?

Mar 10 '20 22:03 bbbales2

Now this just mirrors the Rstan way of doing this. The prototype was relatively easy to implement.

If someone can approve that this is vaguely the right way to go (@jgabry, @avehtari or anyone), then I'll talk to people about interfaces and how this should go in.

Prototype code is there and can be run.

Mar 15 '20 20:03 bbbales2

I now reviewed only the design. After it's fixed, I can review code, too.

Mar 16 '20 19:03 avehtari

It sounds the same. The code is basically from

https://github.com/stan-dev/rstan/blob/develop/rstan/rstan/R/stanmodel-class.R#L446

onwards. The main thing is that it is better to finite-diff the autodiffed gradient than to finite-diff the log_prob function twice.

On Mon, Mar 16, 2020 at 4:01 PM Ben Bales [email protected] wrote:

@bbbales2 commented on this pull request.

In designs/0016-hessian_optimize_cmdstan.md https://github.com/stan-dev/design-docs/pull/16#discussion_r393279345:

+Providing draws instead of the Laplace approximation itself is rather inefficient, but it is the easiest thing to code. + +We also have to deal with possible singular Hessians. This is why I also added the laplace_diag_shift to overcome these. They'll probably be quite common, especially with the Hessians computed with finite differences. + +# Rationale and alternatives +[rationale-and-alternatives]: #rationale-and-alternatives + +Another design would be the print the Hessian on the unconstrained space and let users handle the sampling and the parameter transformation. The issue here is there is no good way for users to do these parameter transformations outside of certain interfaces (at least Rstan, maybe PyStan). + +Another design would be to print a Hessian on the constrained space and let users handle the sampling. In this case users would also be expected to handle the constraints, and I don't know how that would work practically rejection sampling maybe?) + +# Prior art +[prior-art]: #prior-art + +Rstan does a version of this already.

I should check that it's actually the same. @bgoodri https://github.com/bgoodri is this actually the same?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stan-dev/design-docs/pull/16#discussion_r393279345, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZ2XKRHUJYURDYNZB4AVILRH2AR5ANCNFSM4LD3MERA .

Mar 16 '20 20:03 bgoodri

In terms of right way to go the autodiff is better (speed and precision both) where supported, but doesn't exist for some of our higher-order solvers like ODEs.

Mar 17 '20 16:03 bob-carpenter

@avehtari this should be available to test here:

git clone --recursive --branch=feature/design-doc-16-optimize-hessian https://github.com/stan-dev/cmdstan.git cmdstan-hessian

Currently I only implemented stuff for the lbfgs optimizer. Could you give this a go and check me for statistical correctness in terms of what I'm spitting out?

There are some software design issues:

Need to share calculations between all the different optimizers (lbfgs, bfgs, and Newton)
If we get errors on log_p evaluation, what to do
If we get errors on generated quantities evaluation, what to do
Do we print the laplace draws in the output file? If we do we have a csv with two different formats of outputs. The optimization output has less columns than the sampling output, and they also are different things.

I'd prefer to print a warning and skip the sample for 2 and 3 instead of blowing up. In that case if you request N posterior draws you might only get M < N actual draws. Blowing up is bad cause then you have to redo the whole optimization (and things might still blow up). I think if we do rejection sampling to make sure we get N posterior draws we'd be in danger of people writing models that never finish.

Edit: I guess options for 4 are:

a. Leave the different outputs in the same file b. Get rid of the actual MAP estimate. Just print samples from Laplace approximation c. Make a laplace_file argument and put the draws in there

@bgoodri I'm currently just too lazy to write and test the hessian_auto equivalent that uses vars. If someone wants to yell at me I guess I'd get unlazy and go do it.

@bob-carpenter I think he was just saying finite diff reverse mode to get the Hessian. That makes a lot of sense, but I just wanted to lean on what was available in stan-dev/math instead of writing my own.

Mar 17 '20 20:03 bbbales2

You are going to have more indeterminate Hessians if you use finite differences twice.

On Tue, Mar 17, 2020, 3:01 PM Ben Bales [email protected] wrote:

@avehtari https://github.com/avehtari this should be available to test here:

git clone --recursive --branch=feature/design-doc-16-optimize-hessian https://github.com/stan-dev/cmdstan.git cmdstan-hessian

Currently I only implemented stuff for the lbfgs optimizer. Could you give this a go and check me for statistical correctness in terms of what I'm spitting out?

There are some software design issues:

Need to share calculations between all the different optimizers (lbfgs, bfgs, and Newton)

If we get errors on log_p evaluation, what to do

If we get errors on generated quantities evaluation, what to do

Do we print the laplace draws in the output file? If we do we have a csv with two different formats of outputs. The optimization output has less columns than the sampling output, and they also are different things.

I'd prefer to print a warning and skip the sample for 2 and 3 instead of blowing up. In that case if you request N posterior draws you might only get M < N actual draws. Blowing up is bad cause then you have to redo the whole optimization (and things might still blow up). I think if we do rejection sampling to make sure we get N posterior draws we'd be in danger of people writing models that never finish.

@bgoodri https://github.com/bgoodri I'm currently just too lazy to write and test the hessian_auto equivalent that uses vars. If someone wants to yell at me I guess I'd get unlazy and go do it.

@bob-carpenter https://github.com/bob-carpenter I think he was just saying finite diff reverse mode to get the Hessian. That makes a lot of sense, but I just wanted to lean on what was available in stan-dev/math instead of writing my own.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stan-dev/design-docs/pull/16#issuecomment-600269873, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZ2XKS4HTJHZFTMDXZVNW3RH7JIPANCNFSM4LD3MERA .

Mar 17 '20 20:03 bgoodri

Sigh I guess I'll eat my vegetables. stan/math/rev/functor/finite_diff_hessian_auto.hpp or something incoming, but should still be able to talk through the other issues before I get this done.

Mar 17 '20 20:03 bbbales2

Do we print the laplace draws in the output file? If we do we have a csv with two different formats of outputs. The optimization output has less columns than the sampling output, and they also are different things.

Change the output for the optimization also when there are no approximate draws? If I remember correctly advi has the modal value as the first row and if the number of draws is > 0, the rest of the rows are draws. optimizing could follow that

Mar 18 '20 17:03 avehtari

I believe that the discussion is missing a critical point -- the current optimizer behavior is not appropriate for computing posterior Laplace approximations. The RStan reference implementation has never been correct.

Remember that the current optimizer doesn't include Jacobian corrections, so instead of computing a maximum of the posterior density it's actually computing the maximum of a penalized likelihood function. The Hessian at this penalized maximum likelihood could be used for defining confidence intervals (although because of the penalty functions they wouldn't have clear coverage properties) but the samples wouldn't have any meaning in this context.

In order to compute a Laplace approximation the optimizer would have to run with the Jacobian corrections turned on so that the maximum posterior density and the Hessian of the right posterior density is computed. This is sufficiently different behavior that it would at the very least warrant its own method rather than trying to force it into an inappropriate penalized maximum likelihood context.

Mar 18 '20 18:03 betanalpha

maximum of the posterior density it's actually computing the maximum of a penalized likelihood function.

Can you say this in terms of unconstrained and constrained space? We want the mode in the unconstrained space. Do we get that or something else (I've seen the explanation before, but keep forgetting enough to be uncertain)?

but the samples wouldn't have any meaning in this context.

When using importance sampling they have meaning of being draws from a proposal distribution. It can be that this is not the optimal proposal distribution, but they have meaning and diagnostic when that proposal distribution is bad. Of course we would like to have the best proposal distribution.

Mar 18 '20 18:03 avehtari

maximum of the posterior density it's actually computing the maximum of a penalized likelihood function.

Can you say this in terms of unconstrained and constrained space? We want the mode in the unconstrained space. Do we get that or something else (I've seen the explanation before, but keep forgetting enough to be uncertain)?

The current code computes the mode on the constrained space, not the unconstrained space.

but the samples wouldn't have any meaning in this context.

When using importance sampling they have meaning of being draws from a proposal distribution. It can be that this is not the optimal proposal distribution, but they have meaning and diagnostic when that proposal distribution is bad. Of course we would like to have the best proposal distribution.

Sure, but is that worth changing the output of optimizing and confusing users as to its meaning? I think that if samples are included then many users will use them as if they were exact samples or samples from the Laplace approximation and not weights things carefully enough.

To be clear I have no problem with adding this functionality, I just think that it would be much clearer to have a new route with this functionality that wraps the internal optimizer and adds all of this post processing. That way the current optimizer can stay as it is and the new route could turn on the Jacobian corrections as needed.

Mar 18 '20 19:03 betanalpha

To put it another way, we do the optimization on the unconstrained space, but turn off the Jacobian adjustment for the change of variables. That's easy enough to turn on from the model class---it's just a template parameter. So maybe we need another flag on the service interface for optimization for +/- Jacobian.

I'm also with @betanalpha that we don't want to change current behavior, especially by automatically returning something much larger.

Mar 18 '20 19:03 bob-carpenter

The current code computes the mode on the constrained space, not the unconstrained space.

I wasn't thinking about that. Good catch. I guess these are just different functions with and without the adjustments.

I just think that it would be much clearer to have a new route with this functionality that wraps the internal optimizer and adds all of this post processing.

Sounds like this would be better suited as a new algorithm? That way there's no point estimate/sample output confusion either -- the Laplace algorithm only samples from a Laplace approximation to the posterior.

Mar 19 '20 02:03 bbbales2

So we'd have (1) MLE without Jacobian correction, with std errors from the Hessian on the unconstrained scale translated back to constrained, and (2) Laplace approximation using Jacobian correction, with a set of draws returned.

It'd be nice to retrieve the parameters to the normal for the Laplace, as it'd let you generate more draws in the future. But that's very much a secondary consideration compared to getting the draws out in our standard interface.

Mar 19 '20 15:03 bob-carpenter

Laplace approximation using Jacobian correction

I think we'd only do this, however we end up doing it interface-wise. At least, this is what I meant to do so that is my goal

Mar 19 '20 17:03 bbbales2

The current code computes the mode on the constrained space, not the unconstrained space.

Thanks Mike, now I remember we have discussed this same issue several times before! It seems I just keep forgetting it because the mode and Hessian in unconstrained space so much more natural for me. Great if we can now have also mode and Hessian in unconstrained space.

Mar 19 '20 19:03 avehtari

This popped up over here: https://github.com/stan-dev/pystan/issues/694

I still need it for my own purposes too (current just using a version of cmdstan with this hacked in).

Would a new algorithm be appropriate? So algorithms would be sampling, optimization, variational, laplace? If so I'll rewrite the design doc with that in mind and we can iterate on it.

Apr 16 '20 12:04 bbbales2

Makes sense to me.

On Apr 16, 2020, at 8:21 AM, Ben Bales [email protected] wrote:

This popped up over here: stan-dev/pystan#694 https://github.com/stan-dev/pystan/issues/694 I still need it for my own purposes too (current just using a version of cmdstan with this hacked in).

Would a new algorithm be appropriate? So algorithms would be sampling, optimization, variational, laplace? If so I'll rewrite the design doc with that in mind and we can iterate on it.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stan-dev/design-docs/pull/16#issuecomment-614618142, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZYCUMLPABIZECGFZ2U3ALRM3Z5FANCNFSM4LD3MERA.

Apr 16 '20 20:04 andrewgelman

@avehtari @betanalpha @bgoodri this is ready for comments again.

I rewrote it as a new algorithm since the first thing was just wrong. I think the differences @betanalpha pointed out between what we want to do here and optimization are enough that I kinda don't want to put them together, even if the gears are basically the same. I dunno. Opinions welcome.

It's kinda awkward that it'll share all the same parameters as optimization but be a different algorithm, but I guess that's just interface overhead.

I think @bgoodri is right and this should be written with finite differences of gradients instead of just finite differences. I was working with this code hacked together on another project and this was important (and not easily diagnosed). It makes me want to do the full second order autodiff, but we can't with the higher order algorithms.

Apr 20 '20 20:04 bbbales2

Speaking of Hessians, we should also change the testing framework to use finite diffs of gradients rather than double finite diffs for Hessians. I think it could improve the tolerances dramatically.

Apr 20 '20 21:04 bob-carpenter

how to handle errors evaluating the log density

why not then output NA, -Inf, or Inf based on what the log density evaluation returns?

An alternative that seems appealing at first is printing out the mode and hessian so that it would be possible for people to make their own posterior draws.

This would be useful if people want more draws without need to run the optimizer and Hessian computation again, but it's different behavior and would need to be considered together with variational output.

Apr 23 '20 13:04 avehtari

how to handle errors evaluating the log density

why not then output NA, -Inf, or Inf based on what the log density evaluation returns?

Errors raise exceptions, so there's not a return value. We can also have NaN propagate through (there's no NA in Stan like in R, but presumably we're talking about not-a-number).

Apr 23 '20 21:04 bob-carpenter

why not then output NA, -Inf, or Inf based on what the log density evaluation returns?

I don't think there's a way to write these to csv files. Maybe we already have a convention for it though?

This would be useful if people want more draws without need to run the optimizer and Hessian computation again

It's a drawback. Getting draws is somewhat expensive since we evaluate the model log density, so there's that.

Apr 24 '20 01:04 bbbales2

Lookin' for some feeeedback. Otherwise I'll just do it.

The actual new new code for this won't be much, but it'll be a lot of interface bits.

Apr 29 '20 19:04 bbbales2

I just want to clarify that this design doc does not address the computational validation needed for an algorithm to be considered for inclusion in the Stan Algorithm library.

On Apr 30, 2020, at 2:44 AM, Aki Vehtari [email protected] wrote:

@avehtari approved this pull request.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stan-dev/design-docs/pull/16#pullrequestreview-403241909, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALU3FUVFKJ564EADCDM6ITRPEM4HANCNFSM4LD3MERA.

Apr 30 '20 15:04 betanalpha

@betanalpha --- what would you like to see in the way of validation for something like this?

Apr 30 '20 15:04 bob-carpenter

The current policy is at https://github.com/stan-dev/stan/wiki/Proposing-Algorithms-for-Inclusion-Into-Stan https://github.com/stan-dev/stan/wiki/Proposing-Algorithms-for-Inclusion-Into-Stan.

On Apr 30, 2020, at 11:56 AM, Bob Carpenter [email protected] wrote:

@betanalpha https://github.com/betanalpha --- what would you like to see in the way of validation for something like this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stan-dev/design-docs/pull/16#issuecomment-621943235, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALU3FXP22W3RB7H776O2E3RPGNTXANCNFSM4LD3MERA.

Apr 30 '20 16:04 betanalpha

I had thought this was more like fix for the previous algorithm, but I also now realize the design document is not mentioning diagnostics. Laplace algorithm itself is not novel and there is extensive literature demonstrating when it works and when it fails. PSIS paper https://arxiv.org/abs/1507.02646 describes the diagnostics we can use in the same way as we have diagnostics for MCMC. We can compute Monte Carlo standard errors and diagnostics when the Laplace approximation fails and when MCSEs are not reliable.

May 01 '20 08:05 avehtari

I had thought this was more like fix for the previous algorithm, but I also now realize the design document is not mentioning diagnostics. Laplace algorithm itself is not novel and there is extensive literature demonstrating when it works and when it fails. PSIS paper https://arxiv.org/abs/1507.02646 https://arxiv.org/abs/1507.02646 describes the diagnostics we can use in the same way as we have diagnostics for MCMC. We can compute Monte Carlo standard errors and diagnostics when the Laplace approximation fails and when MCSEs are not reliable.

It’s more complicated than that. The Laplace algorithm has never been in Stan. We started with dynamic HMC for estimating posterior expectation values and penalized maximum likelihood for computing point estimates. ADVI was shoehorned in later somewhat awkwardly as another method for (poorly) estimating posterior expectation values.

We’ve never had a MAP algorithm or Laplace or anything else that claims to do posterior expectation value estimation. The RStan developers added the Hessian hack on top of the penalized maximum likelihood algorithm but that was a unilateral decision of the RStan developer and never part of an official algorithm discussion. Plus, as previously discussed, it was fundamentally flawed as it was based on the wrong optimum.

There have been numerous MCMC algorithms developed in Stan that were never exposed because they were too fragile (random walk, ensemble, etc).
for typical user models but would work fine for the same models on which the Laplace approximation is reasonably accurate. The precedent, other than the messy political situation of ADVI, has been to not include any algorithms that aren’t reasonably robust over typical user models. Diagnostics are necessary but also not sufficient; adding fragile algorithms that work for small classes of models only confuses the Stan user experience. This is discussed further in https://github.com/stan-dev/stan/wiki/Proposing-Algorithms-for-Inclusion-Into-Stan https://github.com/stan-dev/stan/wiki/Proposing-Algorithms-for-Inclusion-Into-Stan.

I agree that there is plenty of literature on the Laplace algorithm, but I disagree that there is solid theory backing up the robustness of the method in the preasymptotic, mostly non-log concave regime where typical user models live. Without that relevant theory we’ll need substantial empirical evidence for consideration of inclusion.

May 01 '20 18:05 betanalpha

design-docs
design-docs copied to clipboard

Add Laplace approximation as algorithm

@bbbales2 commented on this pull request.

design-docs design-docs copied to clipboard

Add Laplace approximation as algorithm

@bbbales2 commented on this pull request.

design-docs
design-docs copied to clipboard