stan icon indicating copy to clipboard operation
stan copied to clipboard

Switch to turn Jacobian on/off for optimization

Open bob-carpenter opened this issue 7 years ago • 11 comments

Summary:

Add an option to services to include or exclude log Jacobian determinant of the unconstrained to constrained transform.

Description

Right now, we automatically disable the Jacobian for optimization, which produces maximum posterior mode estimates on the constrained scale. This is the maximum a posteriori estimate most users are going to expect.

It would be nice to be able to do max posterior mode estimates on the unconstrained scale. The primary motivation is Laplace approximations on the unconstrained scale, which can be sampled from and then mapped back to the constrained scale.

This will have to start by adding a boolean argument apply_jacobian to the service function lbfgs() in lbfgs.hpp for turning the Jacobian off or on. The default value should be false.

Then, the logic will have to be traced down through the optimizer, which is an instance of optimization::BFGSLineSearch<Model, optimization::LBFGSUpdate<> > (confusingly defined in in bfgs.hpp, not in bfgs_linesearch.hpp).

From the line search, the flag has to get passed down to instantiate the call to log_prob_propto, which is now hard coded to log_prob_propto<false>. That will have to be done through the class ModelAdaptor, which should get a template parameter for applying Jacobian.

I'm not sure how high up that template parameter should go. At some point, the boolean variable in the services call will have to be used in a conditional to determine which version of log_prob_propto to call:

if (apply_jacobian)
  log_prob_propto<true>(...);
else
  log_prob_propto<false>(...);

because we can't use a dynamic variable as a template parameter.

Current Version:

v2.17.1

bob-carpenter avatar Feb 20 '18 21:02 bob-carpenter

Adding the Jacobian here won't do MAP as it is already doing MAP without the Jacobian. It would only be Maximum Likelihood if there were no priors specified in the model.

What exactly is the use case for this feature? I'm hard pressed to understand why someone would want this as the results depend on exactly how we define the constraining transformations which could be changed.

mbrubake avatar Feb 21 '18 05:02 mbrubake

@avehtari requested this feature, so maybe he can comment.

It's a bit opaque to users what's going on now, so one thing we can do is provide better doc. I tried in a case study, but I think I need to try harder.

If I'm thinking about this right, what we have is:

  • no Jacobian: MAP for the constrained model

  • Jacobian: MAP for the unconstrained model

I agree that what most users are going to expect is what we have now. I'm guessing Aki may want the latter in order to do a Laplace approximation of the unconstrained posterior, draws from which would then be translated back to the constrained space as an approximate Bayesian method.

bob-carpenter avatar Feb 21 '18 15:02 bob-carpenter

  • no Jacobian: MAP for the constrained model
  • Jacobian: MAP for the unconstrained model

Good summary, which would good to have in the case study, too.

I agree that what most users are going to expect is what we have now.

I'm not the only one expecting Jacobian. It seems it's what machine learning people expect.

I'm guessing Aki may want the latter in order to do a Laplace approximation of the unconstrained posterior, draws from which would then be translated back to the constrained space as an approximate Bayesian method.

Yes. The posterior is often closer to normal distribution in unconstrained space. This will especially helpful when we get, for example, compound GLVM+Laplace functions. In these cases it's possible that marginal likelihood evaluations are costly, but the marginal posterior is low dimensional and close to normal in unconstrained space. This what INLA, GPstuff etc. use.

avehtari avatar Feb 21 '18 17:02 avehtari

Right, ok, I get this. My main concern was the MLE vs MAP framing in the original issue description. Perhaps the flag could be something about "optimize posterior in constrained space" vs "optimize posterior in unconstrained space".

I can see why some ML people would expect this behaviour, but I still think it's the wrong behaviour for most "normal" (i.e., non-ML) users so we want to make sure the language/description of the option is super clear.

mbrubake avatar Feb 22 '18 00:02 mbrubake

Thanks, @mbrubake, I'll clarify in the issue---it was my sloppy language.

bob-carpenter avatar Feb 22 '18 00:02 bob-carpenter

I can see why some ML people would expect this behaviour, but I still think it's the wrong behaviour for most "normal" (i.e., non-ML) users so we want to make sure the language/description of the option is super clear.

I did mention INLA, and INLA is definitely non-ML (and full of normal :)

Perhaps the flag could be something about "optimize posterior in constrained space" vs "optimize posterior in unconstrained space".

This is very good proposal.

avehtari avatar Feb 22 '18 16:02 avehtari

Let me chime in here as another user. Sometimes I have to settle for MAP estimates for predictive tasks because I can't afford the computational burden of full MCMC, and VB is still experimental. I'm not trying to do a Laplace approximation, I just want a better point estimate for prediction. In this case I prefer a MAP estimate on the unconstrained space, since I assume that the posterior will be less skewed, and hence the MAP estimate will be more representative of the posterior.

ksvanhorn avatar Mar 12 '18 15:03 ksvanhorn

Thanks for chiming in. It would be interesting to compare the differences from Andrew's perspective of taking MAP plus Hessian to be a kind of approximate Bayes. There are two different ways to do this.

Stan doesn't give you an easy way of calculating the Hessian on the constrained space yet. That's another feature we'd like to be able to get to in order to provide traditional error approximations.

We'll definitely get to this feature.

bob-carpenter avatar Mar 12 '18 17:03 bob-carpenter

It would be interesting to compare the differences from Andrew's perspective of taking MAP plus Hessian to be a kind of approximate Bayes.

With the new distributional approximation diagnostics https://arxiv.org/abs/1802.02538 we can also find out when the posterior is simple enough (normal enough) that MAP is good, and using PSIS we can then also correct the estimates. It's not going to be common that normal approximation at the mode would be good enough for IS, but in those specific cases the user would be certainly happy if they can be confident that there is no need to run MCMC for better accuracy.

avehtari avatar Mar 16 '18 11:03 avehtari

One consideration worth mentioning in this context is that the results of "MAP in the unconstrained space" will depend on exactly how we parameterize the constraints. That means that results of this mode could change if the parameterization was changed under the hood. This has already happened once when we changed how unit_vector was parameterized. It's probably not going to happen a lot, but it's not inconceivable.

mbrubake avatar Mar 16 '18 13:03 mbrubake

MAP with Jacobian on the unconstrained scale is easy to explain formally. I agree with Marcus that it's going to be a lot harder to get across conceptually. I was never really happy with the case study in which I tried to explain these things, but that's the kind of content I'd be aiming for in the manual, only done more clearly.

The unit_vector thing was a bug fix---it never worked properly until the most recent update from Ben.

bob-carpenter avatar Mar 16 '18 17:03 bob-carpenter

Closed by https://github.com/stan-dev/stan/pull/3152

WardBrian avatar Mar 20 '23 18:03 WardBrian