stan Expose gradient

One thing that we may be able to do within weeks rather than months that would allow Andrew to do more of this in Stan proper is to expose a score function (which might be somewhat useful for other purposes too), a.k.a. the gradient of the log-likelihood

http://en.wikipedia.org/wiki/Score_%28statistics%29

Thus, the estimated variance-covariance matrix of the estimates could be obtained via the Outer Product of Gradients method, see equation (14-18) in

http://pages.stern.nyu.edu/~wgreene/NUI2011/Greene-Chapter-14.pdf

That way Andrew could write a generated quantities block like this

generated quantities {
  matrix[K,K] H;
  vector[K] draws[SIMS];
  vector nald[SIMS]
  vector pll[SIMS]

  H <- rep_matrix(0.0,K,K)
  for (i in 1:N) H <- H + tcrossprod(score(foo_log(y[i], ...));
  H <- H / N;
  V <- inverse_spd(H);
  for (i in 1:SIMS) {
    draws[i] <- normal_rng(theta, V);
    nald[i] <- normal_log(draws[i], theta, V);
    pll[i] <- /* rebake penalized log-likelihood evaluated at draws[i] */
  }
}

Mar 14 '14 16:03 bgoodri

If I understood correctly, Ben commented on the mailing list that score() is the gradient of the log likelihood.

I have no idea how we could implement that in Stan as it stands now. Once we add functions, it should be much more practical to do this at least for double inputs in the generated quantities block. but if we need to roll it into the model itself, it'd require higher-order autodiff as Marcus pointed out on the list and would be hugely complicted to implement.

From the outside of Stan's modleing language, it's easy to get the gradient w.r.t. the transformed (unconstrained) parameters, but it's derivatives of the whole log probability, not just the likelihood. So it may require writing a second model just for the log likelihood.

Mar 14 '14 16:03 bob-carpenter

Well, Andrew was trying to use this in a ML context, so if we can only get it working for that first, it would be okay. The generated quantities block would have to do the log-likelihood for each observation (like WAIC) anyway, because you need to sum the outer product of the score vectors for each observation.

On Fri, Mar 14, 2014 at 12:36 PM, Bob Carpenter [email protected]:

If I understood correctly, Ben commented on the mailing list that score()is the gradient of the log likelihood.

I have no idea how we could implement that in Stan as it stands now. Once we add functions, it should be much more practical to do this at least for double inputs in the generated quantities block. but if we need to roll it into the model itself, it'd require higher-order autodiff as Marcus pointed out on the list and would be hugely complicted to implement.

From the outside of Stan's modleing language, it's easy to get the gradient w.r.t. the transformed (unconstrained) parameters, but it's derivatives of the whole log probability, not just the likelihood. So it may require writing a second model just for the log likelihood.

Reply to this email directly or view it on GitHubhttps://github.com/stan-dev/stan/issues/605#issuecomment-37668229 .

Mar 14 '14 16:03 bgoodri

Hi, thanks for thinking of me! For Waic and cross-validation we will want to evaluate individual terms of the log likelihood but for mle etc., we just need to evaluate the entire log joint density, I don’t think we’ll need to be doing it term by term. A

On Mar 14, 2014, at 5:41 PM, bgoodri [email protected] wrote:

Well, Andrew was trying to use this in a ML context, so if we can only get it working for that first, it would be okay. The generated quantities block would have to do the log-likelihood for each observation (like WAIC) anyway, because you need to sum the outer product of the score vectors for each observation.

On Fri, Mar 14, 2014 at 12:36 PM, Bob Carpenter [email protected]:

If I understood correctly, Ben commented on the mailing list that score()is the gradient of the log likelihood.

I have no idea how we could implement that in Stan as it stands now. Once we add functions, it should be much more practical to do this at least for double inputs in the generated quantities block. but if we need to roll it into the model itself, it'd require higher-order autodiff as Marcus pointed out on the list and would be hugely complicted to implement.

From the outside of Stan's modleing language, it's easy to get the gradient w.r.t. the transformed (unconstrained) parameters, but it's derivatives of the whole log probability, not just the likelihood. So it may require writing a second model just for the log likelihood.

Reply to this email directly or view it on GitHubhttps://github.com/stan-dev/stan/issues/605#issuecomment-37668229 .

— Reply to this email directly or view it on GitHub.

Mar 14 '14 17:03 andrewgelman

You do need the score on an observation-by-observation basis to estimate the Hessian this way. Maybe there is some other way to expose the Hessian to the generated quantities block.

On Fri, Mar 14, 2014 at 1:27 PM, Andrew Gelman [email protected]:

Hi, thanks for thinking of me! For Waic and cross-validation we will want to evaluate individual terms of the log likelihood but for mle etc., we just need to evaluate the entire log joint density, I don't think we'll need to be doing it term by term. A

On Mar 14, 2014, at 5:41 PM, bgoodri [email protected] wrote:

Well, Andrew was trying to use this in a ML context, so if we can only get it working for that first, it would be okay. The generated quantities block would have to do the log-likelihood for each observation (like WAIC) anyway, because you need to sum the outer product of the score vectors for each observation.

On Fri, Mar 14, 2014 at 12:36 PM, Bob Carpenter < [email protected]>wrote:

If I understood correctly, Ben commented on the mailing list that score()is the gradient of the log likelihood.

I have no idea how we could implement that in Stan as it stands now. Once we add functions, it should be much more practical to do this at least for double inputs in the generated quantities block. but if we need to roll it into the model itself, it'd require higher-order autodiff as Marcus pointed out on the list and would be hugely complicted to implement.

From the outside of Stan's modleing language, it's easy to get the gradient w.r.t. the transformed (unconstrained) parameters, but it's derivatives of the whole log probability, not just the likelihood. So it may require writing a second model just for the log likelihood.

Reply to this email directly or view it on GitHub< https://github.com/stan-dev/stan/issues/605#issuecomment-37668229> .

Reply to this email directly or view it on GitHub.

Reply to this email directly or view it on GitHubhttps://github.com/stan-dev/stan/issues/605#issuecomment-37673886 .

Mar 14 '14 17:03 bgoodri

This would be useful for PyStan. I'll reopen the issue. If there's a better place for this, I'd be happy to migrate it.

Mar 05 '16 19:03 ariddell

@ariddell, I think this is a language issue.

@bgoodri, are you envisioning something like this only happening in generated quantities?

Nov 30 '16 15:11 syclik

On Wed, Nov 30, 2016 at 10:23 AM, Daniel Lee [email protected] wrote:

@bgoodri https://github.com/bgoodri, are you envisioning something like this only happening in generated quantities?

No; the likelihood can be a function of partial derivatives.

Nov 30 '16 16:11 bgoodri

Hmm... that makes it much more of a pain, but I don't think it's impossible. We'd need to add the score() function to the math library that does the nested autodiff, right? Then expose that through the language?

Nov 30 '16 16:11 syclik

On Wed, Nov 30, 2016 at 11:11 AM, Daniel Lee [email protected] wrote:

Hmm... that makes it much more of a pain, but I don't think it's impossible. We'd need to add the score() function to the math library that does the nested autodiff, right? Then expose that through the language?

Something like that

Nov 30 '16 16:11 bgoodri

I assume it is still not possible to expose the log-likelihood gradient w.r.t. the parameters in, say, the transformed parameters block? If so, I'd love to +1 this issue - it would allow a much more flexible use of Stan. In my case, for now I would like to use Stan to just evaluate log-{posterior, likelihood, prior} probabilities and their gradients for a given Stan model and use them for a custom sampler / inference method.

Dec 13 '22 09:12 simeoncarstens

Accessing those within the blocks of a Stan model is not likely to be available any time soon.

In my case, for now I would like to use Stan to just evaluate log-{posterior, likelihood, prior} probabilities and their gradients for a given Stan model and use them for a custom sampler / inference method.

If I'm reading this correctly, you can already accomplish this in R (via grad_log_prob in rstan or cmdstanr), or through other interfaces via BridgeStan: https://discourse.mc-stan.org/t/bridgestan-1-0-0-released/29742

Dec 13 '22 10:12 andrjohns

Oh wow, I did not know about BridgeStan! That looks super useful for my use case - I am currently using httpstan (and not even pystan :facepalm:) to access the log-posterior probability and its gradient, which is very, very slow. BridgeStan should be much faster for this purpose. Thanks a lot for that hint!

As for my specific question above, BridgeStan or other interfaces indeed allow to access the log-probability and the gradient, but, if I'm not mistaken, only for the posterior. What I would love to have is access to the values of the log-likelihood and the log-prior(s) and their gradients separately. I think I can access the log-likelihood separately by implementing it manually in the transformed parameters section and accessing it in BridgeStan via param_constrain, setting include_tp to true. But that does not give me the gradient :slightly_frowning_face:

Dec 13 '22 13:12 simeoncarstens

Sorry, @simeoncarstens, but Stan models don't separate prior and likelihood.

What you can do, in a pinch, is write two Stan models, one with the likelihood and one with the prior. You could even get clever with includes to not duplicate code, but it gets messy.

Or, you can write the log prior in a conditional,

data {
  bool include_prior;
}
...
model {
  ...
   if (include_prior) { ... prior ... }
}

Then you can use the same model including the data value as 1 to include the prior and 0 to exclude it. Then there's just one model, but it's a bit messier. You can do the same with the likelihood to evaluate only the prior if you'd like.

Dec 13 '22 14:12 bob-carpenter

Thanks a lot, @bob-carpenter! Good to have a definite answer and to know that this is not directly possible. But even if in fact a bit messy, I think your proposed solution of having a "conditional" model might do the job for me. Two separate Stan models would indeed be a bit unwieldy.

For context, this is about allowing a Replica Exchange / Parallel Tempering algorithm to apply a temperature to the likelihood only instead of to the full posterior (and then sampling with HMC, thence the gradients).

Dec 13 '22 17:12 simeoncarstens

I've been thinking more about tempering algorithms lately and this is a missing feature in Stan. It'd be nice to be able to have data in a Stan program be set dynamically. Then we could temper the likelihood, for example for sequential Monte Carlo applications.

Dec 14 '22 15:12 bob-carpenter