Help with warmup
I hope I am not abusing this channel with a very specific question. I am trying to use HMC to perform Bayesian inference in an astrophysical simulation. Unfortunately, I cannot provide here the log-posterior I am using, since it is the result of a calculation involving several thousands lines of code.
My test posterior distribution is defined on a 6-dimensional parameter space. I studied the same problem with a variety of MCMC tools, and the "good" results are summarized in this pair plot:
With HMC, I recover this result only occasionally. Since the priors for many parameters do not have the entire real axis as support, I am doing a re-parametrization of the problem using the following transformation:
y = StatsFuns.norminvcdf.(Distributions.cdf.(prior, x))
where prior is the vector of prior distributions for each component of x. With this mapping, each element of the new parameter space y has a normal prior distribution with vanishing mean and unity variance. The log posterior in the new variables also takes a simple form, as it can be written as
logposterior(y) = loglilekelihood(x(y)) + sum(StatsFuns.normlogpdf, y)
where x(y) is the inverse of the transformation written above.
My problem is that when I try to use HMC the warmup phase seems to fail very often. As mentioned, occasionally I obtain results that are consistent with the ones I have verified with independent MCMC techniques, but more often the chain returned is completely off. I tried to diagnose the reasons, but I have failed so far. Also, all the diagnostic I have checked are actually apparently good for HMC, in spite of the wrong results: Rhat is very close to unity and EBFMI is quite large (around 0.8-1.2). Also the trace plots do not reveal anything strange, and the summarize_tree_statistics returns something that to me looks OK (although perhaps the depth is too high?):
Hamiltonian Monte Carlo sample of length 10000
acceptance rate mean: 0.94, 5/25/50/75/95%: 0.77 0.92 0.98 0.99 1.0
termination: divergence => 0%, max_depth => 0%, turning => 100%
depth: 0 => 0%, 1 => 0%, 2 => 1%, 3 => 3%, 4 => 7%, 5 => 21%, 6 => 34%, 7 => 33%, 8 => 0%
In other words, it looks like the warmup phase is completely failing to converge to the typical set. I tried to change the warmup_stages, using for example default_warmup_stages(; middle_steps=50, doubling_stages=7), but with no luck; also a change to a symmetric mass matrix did not help.
I feel I am missing something fundamental here, since I would expect HMC to easily sample this relatively-low dimensional space, unimodal distribution. Any helpp would be really appreciated!
The first thing I would do is start the warmup from a local mode, if one can be found with reasonably little effort. It used to be default for this package, but I removed it because it is not always well-defined (eg in hierarchical models).
Also, if you are implementing the transformation manually, are you adding the log Jacobian determinant correction? Have you considered the transformations in TransformVariables.jl (which does this automatically)?
(and asking questions here is fine)
Thank you for your kind reply. Yes, I think I am adding the log Jacobian correctly. In fact, calling logposterior_x and logposterior_t the original logposterior and the one in the new transformed coordinates respectively, I have
x = rand.(prior)
t(x) = StatsFuns.norminvcdf.(Distributions.cdf(m.prior, x))
logposterior_t(t(x)) ≈ logposterior_x(x) - logdet(ForwardDiff.jacobian(t, x))
which should be the correct relation.
Starting near the mode completely solve all issues. However, in general I will not in the very convenient situation of knowing the mode, and I was actually hoping that HMC would help me discover it! Does this means that HMC should be run with a starting point obtained from standard non-linear optimization techniques? In other words, is HMC efficient in the exploration phase of the typical set, but not so good to find the mode? That would not be a good thing for me, since in high dimensions finding the mode might be very challenging (and I am not even touching the case of multimodal distributions).
As a related question: I understood from Betancourt papers and talks that the mode is actually not a good starting point for MCMC in high dimensional spaces, as it is far from the typical set. I might not seeing this issue (yet) because my problem is only in dimension 6, and in fact starting from the mode helps a lot for me now, but is this also true for, say dim=50 or 100? Am I missing something here?
First, note that the above is not the correct transformation; you want logabsdet (note the abs). But generally, this is not calculated using AD, but using closed form transformations. Since the Jacobian is diagonal (elementwise transformation), this should be simple. Also, nominvcdf may be really costly compared to the logistic family in TransformVariables.jl.
If fixing that does not help, I only have some wild guesses since it is difficult to debug this without and MWE. I would consider these:
- using
TransformVariables.as𝕀instead, - checking the gradient using finite differences (see
logdensity_and_gradient). AD is usually correct in Julia, but it is worth a shot, especially if the log posterior is a complex calculation. - fixing 1--5 variables, eg at the mode, and just sampling from the rest, then if that works, free variables one at a time and see what happens to convergence,
- making sure that the posterior is proper.
the mode is actually not a good starting point for MCMC in high dimensional spaces, as it is far from the typical set
Generally it should not matter, for well-conditioned posteriors you end up in the typical set rather quickly anyway during adaptation.
Sorry, I presume I was not very clear in my explanation. First of all, I am not using AD to compute the transformed likelihood, but rather doing it analytically, using the equation I wrote above: that is
logposterior(y) = loglilekelihood(x(y)) + sum(StatsFuns.normlogpdf, y)
The use of the ForwardDiff stuff above was just to check that the transformation I implemented was correct; and I happened to have a positive Jacobian determinant, therefore I did not care to use the logabsdet (although of course this is what is needed in principle). But, again, I never use ForwardDiff for the transformation itself, mine was just a quick test of correctness.
Regarding the costs of norminvcdf, please note that this is not really my issue, as the likelihood computation is often quite costly for me. My microbenchmarks shows that norminvcdf takes on my laptop about 4.4 ns, compared to a typical cost for the likelihood computation of a few μs in the best-case scenario. Anyway, both the logistic and logic function have typical execution times around 3.5 ns, so the gain is really minimal for me.
Rather, I am asking myself if the different transformation used might play a role in the poor sampling properties I am experiencing. I naively assumed that having a normal distributed transformed prior might be a good thing, but perhaps instead using the logistic transformation helps? I would be surprised though, since they do not differ too much (their ratio is always between 1.5 and 2).
I will check what happens if I let just one of the parameters uninitialized. Regarding the posterior, it is definitely proper: the prior is proper, the likelihood does not have singularities, and I am able to compute the evidence using some of the alternative techniques I mentioned. For the rest, as you can see from the pair plot above, there are two pretty strong degeneracies in the likelihood which might create issues with various samplers, but that is the only concerning thing I can see.
@astrozot, did you manage to resolve this issue?
Not really. I tried various things, including
- using
TransformVariables - using directly the logistic / logit functions instead
- starting with just one parameter off the mode
- changing the warmup stages in various ways
None of these techniques works. The logistic / logit functions are actually really similar to what I was using before (normal cdf), and in fact the Wikipedia page on logit highlights this similarity. If I fix all parameters to the mode, the library works beautifully; but of course in a more complicated scenario I would not know the mode. Also, if I shift just one parameter, I can either have convergence or not, depending on the parameter changed and on the amount of the shift.
I guess the problem I am trying to solve is not trivial and the density function has multiple peaks (multimodal), although the main one is much higher than the secondary ones (I know this from other samplers). Therefore, the starting point plays a big role for me, and in fact the warmup phase (and all the adapted parameters) can give completely different results. I have seen something similar with other samplers, including of course MH, but I was hoping the HMC could perform better.
My understanding instead is that HMC is great if one starts close to the mode of the distribution, much less in general cases: then the exploration of the typical set is very efficient. However, it seems not to be very robust in finding the typical set or the mode of the distribution in general; or, at least, seems to be less robust than ensemble samplers or parallel tempering ones.
Note that, assuming I am not doing something wrong, my conclusions are not general at all and heavily dependent on my particular logdensity form. I tried unimodal "complicated" distributions such as a high-dimensional Rosenbrock one and HMC works beautifully there.
Yes, HMC can still fail with multimodal distributions, especially if the modes have almost-0 posterior in between. I plan to code the WALNUTS algorithm one of these days that is supposed to be better.
WALNUTS looks quite interesting, it would be really good to have it implemented. I am not sure how much it will help my case though, since it seems to me that it does not address in detail the warmup phase.
Quick question: is there a way for the user of DynamicHMC to specify a fixed mass matrix? I suspect my problem is mostly there.
Yes, of course, see the help string of mcmc_with_warmup (I assume that you would want to warmup the stepsize though).
That said, have you tried the Symmetric mass matrix? The default is diagonal. Sorry I was kind of assuming you tried it already and it did not occur to me to suggest it, but that would be the obvious low hanging fruit. The default is Diagonal. Again see that docstring.