dynesty icon indicating copy to clipboard operation
dynesty copied to clipboard

Allow variable mcmc length

Open ColmTalbot opened this issue 2 years ago • 9 comments

This PR adds a new option that allows the length of the MCMC exploration for the rwalk method to vary during the analysis such that a target number of proposed steps are accepted. There is also an option to disable the scale adaptation of the proposal.

  • The default behaviour matches the existing behaviour.
  • There is a new unit test with the variable MCMC length that passes well.

cc @joshspeagle @segasai

ColmTalbot avatar Apr 07 '22 00:04 ColmTalbot

Pull Request Test Coverage Report for Build 2105900692

  • 44 of 47 (93.62%) changed or added relevant lines in 4 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.04%) to 90.088%

Changes Missing Coverage Covered Lines Changed/Added Lines %
py/dynesty/dynesty.py 17 20 85.0%
<!-- Total: 44 47
Totals Coverage Status
Change from base Build 2089676675: 0.04%
Covered Lines: 3690
Relevant Lines: 4096

💛 - Coveralls

coveralls avatar Apr 07 '22 02:04 coveralls

Thanks for the PR @ColmTalbot

I think for this to be merged I'd like to see.

  • an example where this new behaviour performs better than the existing behaviour with similar parameters.

I can imagine the new behaviour can be better

  • always (then we want to replace what we have by this new scaling)
  • never, then we don't want to do anything
  • for some distribitutions. Then it'd be good to have some idea what are those.

but at least we need one example so we can assess.

segasai avatar Apr 07 '22 17:04 segasai

I've been thinking about a good method to quantify this. In the end, I've been looking at the Euclidean distance travelled by the MCMC at each stage (in the unit hypercube space).

For a given likelihood threshold, there is a "correct" distribution of these distances. For simple cases, this can be probed by direct sampling from the constrained prior. I think this distribution is

p(\Delta ||u||) = \int_{{\cal L} > \lambda} du_{\rm init} du_{\rm final} \delta(\Delta ||u|| - ||u_{\rm init} - u_{\rm final}||)

This distribution will likely change slowly and smoothly as the threshold increases. If we average over many iterations we can therefore attempt to measure the average expected distance.

If the MCMC chains are too short, increasing the number of steps will increase the average distance. After a certain point, increasing the chain length will no longer increase the average distance.

Note that the distribution of distances won't necessarily converge to the "correct" distribution for some proposals.

Here are a few examples with the current MCMC rwalk method (volumetric proposal).

Here is a figure for a case where changing the MCMC length seems to outperform varying the scale.

volumetric_bimodal_15

The legend entries are:

  • 30: variable proposal scale and 30 steps per chain
  • 100: variable proposal scale and 100 steps per chain
  • 300: variable proposal scale and 300 steps per chain
  • Walks + Scale: variable proposal scale and MCMC length targeting 30 accepted steps
  • Walks: variable MCMC length and fixed proposal scale targeting 30 accepted steps
  • Perfect: directly sampling from the constrained prior

With the scale not changing there are much more large distances. These correspond to iterations where the MCMC jumps between the modes.

The MCMC length increases a lot here in order to keep accepting points, but it manages to mix the different modes much more efficiently.

This is the model

def bimodal_normal(p):
    """
    Bimodal normal distribution

    sigma = 1
    mu = +/-5
    """
    return np.logaddexp(-np.sum((p - 5)**2) / 2, -np.sum((p + 5)**2) / 2)

Here's the same plot for the Rosenbrock model. The variable MCMC length isn't doing better here than the longest fixed MCMC length method, but it is much more efficient (fewer calls per iteration.) In this case, none of these versions are getting close to the expected distribution, but we have other indications that this isn't a great proposal for this model.

volumetric_rosenbrock_2

ColmTalbot avatar Apr 20 '22 16:04 ColmTalbot

Thank you for your examples, but i don't think I agree with your interpretations.

  1. I am not sure that average jumping length is the best statistic to look at. I.e. the relevant statistic is the number of function calls/per independent sample. That's what we actually care for (aside from correctness of the distribution). However even if we look at the jumping length the first plot shows that the pink curve is significantly 'better' in terms of jumping length (comparing to orange), but at the cost of many more function evaluations. (if I correctly read the plot).
  2. I am a bit wary of using 1d example for anything -- I think Rosenbrock is better since it's at least 2d.
  3. For the second example orange and pink (which have similar number of function evaluations, but seem to provide very similar jumping length.

In fact thinking more about it, I am not sure that your proposal is a good idea in the current form (or maybe I don't understand it)

Right now we scale the proposal ellipsoid so that typically (averaged across the whole posterior) the proposal ellipsoid would overlap with 50% of the L>L* area. This takes care of the situation when say the L>L* area is very curved and a single ellipsoid is a bad approximation to it. Imagine a thin spherical gaussian shel radius 1 with one single ellipsoidal bound with radius 1. In this case the proposal based on that ellipsoid will have a horrid acceptance ratio, so the existing code will scale it down till the ellipsoid is about as thick as a shell and then the sampling will proceed (yes, because the proposal will be small, you'll need a lot of walks to have proper samples in this case). But in your proposal when you only adjust the walks parameter, without adjusting the scale, that's going to be very inefficient because you'll be samplign from the large ellipsoid and therefore the number of 'walks' will grow to a very large number.

I think what one would want to do in practice is to keep adjusting the scale in the same way we do it now, but adjust the walks parameter if the samples are too correlated. I.e. if the typical correlation between proposed points and existing points is too large then we slowly adjust the walks parameter. But that needs to be decided based on correlation I think, not acceptance fraction that your current code is doing.


Maybe the first step should be is to actually compute the auto-correlation time like here #163 and then we can actually characterize how correlated our samples are, and potentially introduce the adjustments of the walks.

segasai avatar Apr 20 '22 20:04 segasai

To clarify, the first example above is a 15-dimensional bimodal Gaussian (I forgot to say that above.)

ColmTalbot avatar Apr 20 '22 20:04 ColmTalbot

To clarify, the first example above is a 15-dimensional bimodal Gaussian (I forgot to say that above.)

I see, that makes more sense. Thanks. Was it with a single ellipsoidal bound ?

segasai avatar Apr 20 '22 22:04 segasai

To address the first point. If we don't match the correct distribution, we don't have an independent sample. That's the definition of the "correct" distribution. I think the question we normally ask is "how correlated can the new sample be with the old sample and not be visibly biased", which can be answered by running a bunch of simulations for the problem at hand, but I don't know of a general rule.

I agree that calculating the ACL would be ideal, but in practice, calculating the ACL requires walking for many ACLs, which in turn makes the exploration very inefficient if we only want to advance by a single ACL^.

In the absence of that, we can use this method to assess whether the number of steps is producing an independent sample. If the distribution of distances will match the "correct" distribution, we have an independent sample. We can then compute the minimum chain length required to match that distribution. In the absence of knowing the "correct" distribution, we can assess how many likelihood evaluations per chain are needed to converge when increasing the length of the chain. Other tests, e.g., pp-tests, can be used to assess whether the sampling is unbiased, but at least we know that the issue isn't that the chain length is too short.

Here's a figure demonstrating the complementarity of the two approaches. Here I'm just MCMC sampling from a flat disc with a Gaussian proposal. I then plot the distribution of distances for a range of lags (a proxy for running many iterations of the rwalk with walks=lag) along with for samples drawn analytically from the disc.

I also calculated the ACL of the chain using the method you mentioned. The linestyle indicates whether the lag is more or less than the ACL. Pretty much perfectly, for lags > ACL the distribution matches the expected one. For lags < ACL the distances are too short, but on average they are further when the lag increases.

image


^ One alternative is to run the MCMC for many autocorrelation times to get a stable estimate and then return many points from the chain to get multiple updates (as is currently done when using multi-processing). This leads to pretty wildly varying autocorrelation lengths, which is less than ideal, but re-running on the Rosenbrock example it looks much better.

volumetric_rosenbrock_2

The version where the scale and the MCMC length are both varying is super inefficient, here's the same thing without that configuration. The purple trace (fixed scale, variable MCMC) length uses a comparable number of likelihood calls per sample (and similar total run time) as the green (fixed chain length and variable scale) that does a worse job.

volumetric_rosenbrock_2

Personally, I'd like to at least be able to disable the scale adaptation as the non-varying scale performs better for the Rosenbrock function and volumetric proposal.

Implementing the on-the-fly act estimation is only really viable if we are happy to accept multiple points from each chain.

I still think that the number of accepted points is a decent proxy for the number of autocorrelation lengths traversed, but I don't have a great argument for that at the moment.

ColmTalbot avatar Apr 21 '22 02:04 ColmTalbot

To address the first point. If we don't match the correct distribution, we don't have an independent sample. That's the definition of the "correct" distribution. I think the question we normally ask is "how correlated can the new sample be with the old sample and not be visibly biased", which can be answered by running a bunch of simulations for the problem at hand, but I don't know of a general rule.

Just a quick answer on that. What I meant by 'correct' vs independent is that if the sampling scheme is violating detailed balance and/or doesnt' have the right invariant distribution I call it incorrect and the whole thing is broken. The indepedence of samples is indeed something that is never achieved exactly and will lead to biases, but the hope that they are indepedent "enough" or the most independent we can afford for a given CPU-time budget and the biases are small enough.

segasai avatar Apr 21 '22 02:04 segasai

Regarding the ACF, I haven't used it much, so I am not too familiar with its limitations, so you may be right there. My thinking was that we don't need a very accurate measure of it, we just need to detect cases where the samples are too correlated and we need to 'walk' more. So I thought that there should be enough information on that between each update_rwalk() call. But again, maybe it is unfeasible.

Regarding the distance distribution, I can see your point now better, that difference between distribtuion of distances and the "true" distribution of distances may be an easy to compute metric to quantify the independence of samples.

Given that at each point of the sampling we have N-points that are presumably samples from L>L_* We can assume that the pairwise distances should be a fair sample of the "true" distribution, so maybe we should use that as an indication that we need to have more walks. I.e. if the median separation of the live-points is x, we may want to increase walks till the typical separation is some fraction of x ?

I've found this case of the high-dim shell quite instructive --

import dynesty
import numpy as np
import scipy.stats

nlive = 1000
s0 = 0.01

ndim = 10


def loglike(X):
    return -100 * np.log(1 + np.abs((np.sqrt((X**2).sum()) - 1) / s0)**2)


def prior_transform(x):
    return x * 10 - 5

sampler = dynesty.NestedSampler(loglike,
                                    prior_transform,
                                    ndim,
                                    nlive=nlive,
                                    bound='single',
                                    rstate=rstate,
                                    sample='rwalk',
                                    walks=1)
    sampler.run_nested(print_progress=True)

with this behaviour of the scale parameter image

which clearly leads to a very small step size. here's the x_initial vs x_after_the_walk image

and here is the walking distance as a function of iteration. image while for a 10d sphere with a radius of 0.1 the distance between two random points should be ~ 0.15.

So I think this is the perfect test case to remedy. We would want I think, keep the same 'scale' as we have now, but we would want the walks to increase in the end of the iterations. I haven't tried with your patch what actually happens.

segasai avatar Apr 21 '22 13:04 segasai