anesthetic icon indicating copy to clipboard operation
anesthetic copied to clipboard

Implemented credibility_interval()

Open Stefan-Heimersheim opened this issue 2 years ago • 37 comments

Description

Added a credibility_interval method to MCMCSampels/NestedSamples which computes the credibility / confidence interval.

Fixes #178 and replaces #179

Notes:

  • Discretization error: Using np.cumsum, the value of the CDF at the first data point is == its weight > 0, but at the last data point is == 1. This should be negligible for weights<<1 but for significant weights i.e. small sample sizes this can have an effect.
  • Should we add an alias function confidence_interval
  • I have also thrown in upper and lower limits which are not iso-probability intervals, but rely on the same code

Checklist:

  • [x] I have performed a self-review of my own code
  • [x] My code is PEP8 compliant (flake8 anesthetic tests)
  • [x] My code contains compliant docstrings (pydocstyle --convention=numpy anesthetic)
  • [x] New and existing unit tests pass locally with my changes (python -m pytest)
  • [x] I have added tests that prove my fix is effective or that my feature works

Stefan-Heimersheim avatar Mar 22 '22 16:03 Stefan-Heimersheim

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 100.00%. Comparing base (8726a90) to head (3e6bd4b).

Additional details and impacted files
@@            Coverage Diff            @@
##            master      #188   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           36        36           
  Lines         3058      3127   +69     
=========================================
+ Hits          3058      3127   +69     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Jun 08 '22 17:06 codecov[bot]

The sorting

        order = np.argsort(samples)
        samples = samples[order]

breaks with merged data sets, as in #189

Edit: This does not seem to be an issue anymore,

samples = anesthetic.MCMCSamples(np.random.rand(1000,2),columns=["x","y"],index=np.random.randint(0,100,1000))
anesthetic.utils.credibility_interval(samples.x)

works.

Stefan-Heimersheim avatar Jun 10 '22 21:06 Stefan-Heimersheim

Re CI/lint flake8:

anesthetic/samples.py:11:1: F401 'scipy.interpolate.interp1d' imported but unused
anesthetic/samples.py:12:1: F401 'scipy.optimize.minimize_scalar' imported but unused

Not really sure what's causing flake8 to report an error here. Otherwise this should be good to go; only the quantile treatment isn't merged between credibility_interval and quantile.

Stefan-Heimersheim avatar Jul 19 '22 15:07 Stefan-Heimersheim

My weighted ECDF was slightly different from yours (defined by min sample has CDF 0, max sample has CDF 1, and the intermediate points have a symmetric average of the weights either side).

I am a bit unsure about the new ECDF function: CDF_illustration Looking at say CDF(10)-CDF(3) in this example, there are 7 out of 10 samples in the interval [3,10], but the ECDF (blue line) gives a larger value -- is that a more-correct approximation that the version I initially implemented (green/red lines)?

Stefan-Heimersheim avatar Jul 28 '22 21:07 Stefan-Heimersheim

Excellent plot!

I've done some investigating, and these really are second order effects. What we should be most concerned about is whether cdf(b)-cdf(a) approximates the true probability mass in between b and a.

import numpy as np
from scipy.interpolate import interp1d
from scipy.stats import norm
import matplotlib.pyplot as plt

# Generate a dataset
np.random.seed(3)
data = np.random.randn(30)
weights = np.random.rand(30)
i = np.argsort(data)
data = data[i]
weights = weights[i]/weights.sum()

# Symmetric empirical CDF
w_ = np.concatenate([[0.],weights[1:]+weights[:-1]]).cumsum()
w_/=w_[-1]
sym = interp1d(data, w_)

# Left hand empirical CDF
lh = interp1d(data, np.cumsum(weights))

# Right hand empirical CDF
rh = interp1d(data, 1-np.cumsum(weights[::-1])[::-1])

plt.plot(data,sym(data), label='Symmetric')
plt.plot(data,lh(data), label='Left Hand')
plt.plot(data,rh(data),  label='Right Hand')
plt.plot(data,norm.cdf(data),  label='True CDF')
plt.xlabel('x')
plt.ylabel('CDF')
plt.tight_layout()
plt.savefig('CDF.png')
plt.close()

CDF

So we see that all perform pretty equivalently relative to the correct answer.

Increasing the number of samples improves performance, but the three empirical curves lie on top of one another, which is different at lower order to the true answer.

CDF

Examining the histograms of the performance relative to the true answer shows them to all perform in an unbiased way

W, SYM, LH, RH, T = [], [], [], [], []
for _ in range(10000):
    a, b = np.sort(np.random.uniform(data.min(),data.max(),size=(2)))
    w = weights[(data < b) & (data > a)].sum()
    W.append(w)
    T.append(norm.cdf(b)-norm.cdf(a))
    SYM.append(sym(b)-sym(a))
    LH.append(lh(b)-lh(a))
    RH.append(rh(b)-rh(a))

W, SYM, LH, RH, T = np.array([W, SYM, LH, RH, T])
        
plt.hist(W-T,bins=100)
plt.hist(SYM-T,bins=100)
plt.hist(LH-T,bins=100,alpha=0.8)
plt.hist(RH-T,bins=100,alpha=0.8)
plt.tight_layout()
plt.savefig('hist.png')

hist

Increasing to 1000 points just reduces the spread, but not the difference in the methods.

hist

Unless you can find a stress test which shows any of these to be better or worse than the others, I'd vote for the symmetric one.

williamjameshandley avatar Jul 29 '22 02:07 williamjameshandley

I am confused, clearly in my CDF plot there symmetric method always gives you a steeper gradient, and thus should give always give you a larger CDF(b)-CDF(a) difference. previous And here is your plot again without subtracting the True value:

plt.figure(constrained_layout=True)
plt.hist(SYM,bins=100, histtype="step", label="SYM")
plt.hist(LH,bins=100, histtype="step", label="LH")
plt.hist(RH,bins=100, histtype="step", label="RH")
plt.hist(T,bins=100, histtype="step", label="T")
plt.xlabel("CDF(b)-CDF(a)")
plt.legend()
plt.savefig('hist.png')
plt.show()

hist (I did use 10 data points with weights = np.random.ones(10), I didn't get to double check if that is causing any difference in any of the codes.) Here is the same plot including T: hist

Does the symmetric one do systematically worse? Do both methods have a systematic bias? And what would we expect in theory? Intuitively the LH/RH versions feel like implementations of the basic CDF definition, but I am not sure how to reason about the symmetric formula. I'll try to generalize this benchmark and figure out why we don't see that in your histogram.

Edit: Full code

import numpy as np
from scipy.interpolate import interp1d
from scipy.stats import norm
import matplotlib.pyplot as plt

# Generate a dataset
np.random.seed(3)
data = np.random.randn(10)
weights = np.ones(10)
i = np.argsort(data)
data = data[i]
weights = weights[i]/weights.sum()

# Symmetric empirical CDF
w_ = np.concatenate([[0.],weights[1:]+weights[:-1]]).cumsum()
w_/=w_[-1]
sym = interp1d(data, w_)

# Left hand empirical CDF
lh = interp1d(data, np.cumsum(weights))

# Right hand empirical CDF
rh = interp1d(data, 1-np.cumsum(weights[::-1])[::-1])


W, SYM, LH, RH, T = [], [], [], [], []
for _ in range(10000):
    a, b = np.sort(np.random.uniform(data.min(),data.max(),size=(2)))
    w = weights[(data < b) & (data > a)].sum()
    W.append(w)
    T.append(norm.cdf(b)-norm.cdf(a))
    SYM.append(sym(b)-sym(a))
    LH.append(lh(b)-lh(a))
    RH.append(rh(b)-rh(a))

W, SYM, LH, RH, T = np.array([W, SYM, LH, RH, T])


plt.figure(constrained_layout=True)
plt.hist(SYM,bins=100, histtype="step", label="SYM")
plt.hist(LH,bins=100, histtype="step", label="LH")
plt.hist(RH,bins=100, histtype="step", label="RH")
plt.hist(T,bins=100, histtype="step", label="T")
plt.xlabel("CDF(b)-CDF(a)")
plt.legend()
plt.savefig('hist.png')
plt.show()

Stefan-Heimersheim avatar Aug 03 '22 06:08 Stefan-Heimersheim

I knew I had thought about this before -- In #58, I suggest the use of the weighted percentile method. Turns out there's no universally agreed approach, but the recommended symmetric (C=1/2) version seems to be the best.

The python one-liner is

wpct = interp1d(data, (weights.cumsum()-weights/2)/weights.sum())

@Stefan-Heimersheim could you add this option to your plots?

williamjameshandley avatar Aug 08 '22 20:08 williamjameshandley

Sure, here is the case of equal weights -- wpct is the average between LH and RH, i.e. the same shifted again: Figure_1 And thus gives identical results to RH & LH: hist

But in the case of non-equal weights it looks different, I am not sure yet if all the functions handled the weights correctly or if the differences make sense -- I will double check this as soon as possible. Figure_2 Edit: I see the linked Wikipedia article also shows that the CDFs are not parallel so that makes sense

Stefan-Heimersheim avatar Aug 09 '22 02:08 Stefan-Heimersheim

I'm just working on checking which method makes sense here. First a correction: The plot in my fastCI repo only looked interpolated because I only passed the data points as x values to plt.plot, the true sample-based CDF looks more like this (This is from a new branch I am working on.)

Now to the comparison:

  • I expected the symmetric sym method to over-estimate the CDF based on the figure a few posts above. sym looks identical to the current anesthetic.utils implementation.
  • I expected the discrete method (image above with the orange step-function-y line) to underestimate the inverse-CDF, as it can always find an interval that just about encloses enough points.

Reminder: These are our competitors: intuition Edit: As remarked before, lh, rh, and wpct are equivalent in the case of uniform weights, as are sym and the anesthetic implementation (I think sym is the anesthetic implementation), so I didn't show all the equivalent variants for the uniform-weights plots (I did check it though).

Now here come the benchmarks, always using a Gaussian (was easy) with loc=0, scale=1 (scaling doesn't matter). Runs with 10, 100, and 1000 data points for randomized and uniform weights. CDF_comparison10False CDF_comparison100False CDF_comparison1000False CDF_comparison10True CDF_comparison100True CDF_comparison1000True And here as a function of number of data points (with a few more steps): function_of_Ndata (Note that the only random seed is at the start of the file, I do not know why the neighboring N_data points seem correlated.)

Clearly the symmetric / anesthetic version over-estimates the CDF as expected, but I can't see a trend in the discrete case.

Edit: Code for these tests here. My next step was going to be running a benchmark of the Credibility Interval calculations with all/some of these CDFs, in particular I expect to see the underestimation of the interval in the discrete CDF case.

Stefan-Heimersheim avatar Aug 19 '22 15:08 Stefan-Heimersheim

Very impressive! The first figure makes it pretty clear that sym version is not suitable, and it does seem that it is the interpolation introducing the bias.

Is there a discrete version of sym/the current anesthetic implementation? (I have a sneaking suspicion that this would be identical to wpct, which would be the one I'm leaning toward at the moment).

I would not read too much into the 'correlation' unless it remains upon doubling the density of N_data points.

I expected the discrete method (image above with the orange step-function-y line) to underestimate the inverse-CDF, as it can always find an interval that just about encloses enough points.

This may average out however, since it will also find intervals which just about don't enclose enough points.

williamjameshandley avatar Aug 21 '22 08:08 williamjameshandley

I would not read too much into the 'correlation' unless it remains upon doubling the density of N_data points.

That's easy, here's the same with N_data *= 2 function_of_Ndata

I don't think I can do chi-by-eye here but it doesn't look too concerning.

Stefan-Heimersheim avatar Aug 22 '22 16:08 Stefan-Heimersheim

Here's the error on a) left CI boundary, b) right CI boundary, and c) credibility interval size, always method values minus true theory value. Curiously all methods (sym, lh, wpct, discrete) seem to underestimate the interval sizes, this is unexpected. N_data=100, randomized weights. Code here. Edit: 10000 realizations with random data (Gaussian) and random credibility level [0,1].

bias

size

Stefan-Heimersheim avatar Aug 22 '22 17:08 Stefan-Heimersheim

Here are 20 example data realizations with CIs from the different methods (color code as here): example_intervals

I suspect the performance of the discrete method depends strongly on the chosen level. With 100 data points I suspect it will do well if 100*level is an integer vs if 100*level+0.5 is an integer, or vice versa. Whereas the other methods should depend less on the precise value of level. I'll test that tomorrow.

Edit: Okay one more, here the same but with N_data=10 and added the data points as scatter plots. Surprised by the discrete interval boundaries not matching points, investigate tomorrow. example_intervals

Stefan-Heimersheim avatar Aug 22 '22 17:08 Stefan-Heimersheim

Here's my intuition and reasoning about the 4 CDF cases we are considering. Excuse the anthropomorphisms, I found describing this as a race to find for the shortest interval easier to imagine. The 4 cases were

  • Discrete CDF / dcdf: Green lines here, step-function-y CDF as shown here
  • LH / RH: Green and red lines here
  • WPCT: Wikipedia weighted percentile with C=1/2 here
  • SYM / anest: Symmetrical CDF as implemented in anesthetic, dashed line here

Here is my intuition for the shape of these CDFs. Always assuming weights to be normalized. When noted assuming the simplified case of equal and normalized weights w, and sometimes considering the specific example with N_data=10 and w=0.1. All CDFs go from CDF(-inf)=0, to CDF(inf)=1.

  • DCDF: Step function jumping by w_i at the ith data point. Highest slope is (w_i + w_{i+1})/distance between neighboring points, as you can include both of the jumps in the interval. Generally the minimal interval for any credibility will always "just about" include the edge data points. In the case of equal weights w=0.1 DCDF will always have an advantage of 0.1 probability level over other methods such as LH, RH, and WPCT except in exceptions (see below).
  • LH/RH: Stepwise-linear function. The slope in any section is determined only by the weight of the data point right/left of that section. Slope12=w2/distance (LH) or Slope12=w1/distance (RH). (Slope12 is the section in between the 1st and 2nd data point).
  • WPCT: Stepwise-linear function. The slope in any section is given by the average of the weights of the data points left and right of that section [1]. Slope12 = (w1+w2)/2/distance.

Exception: LH, RH and WPCT also get jumps but only at edge points. LH (WH) gets a jump of w1 (wN) at the left (right) edge, and WPCT has jumps of w1/2 and wN/2 at the left and right edges respectively.

  • SYM: Same as WPCT, but without any jumps, and all slopes scaled by a factor of 2/(w1+2w2+3w3+...+wN), i.e. by a factor of 10/9 in the w=0.1 case [2]. So it's like the non-jump-part of WPCT but stretched out in y-direction to reach 0 and 1, respectively.

I think In the case of uniform weights, DCDF should be the same as LH/RH/WPCT for level rounded down to the nearest sum of weights (except for edge-including-cases). E.g. for w=0.1 DCDF(0.38)=WPCT(0.3).

Footnotes:

[1] To see this consider the equation, weights.cumsum()-weights/2. The first point will be w1/2, the 2nd point will be w1+w2/2 and so on, the difference between points 1 and 2 is (w1/2+w2/2). [2] w_ = np.concatenate([[0.],weights[1:]+weights[:-1]]).cumsum(); w_/=w_[-1] --> [0, w1+w2, w1+2*w2+w3, ...,c]/c, so the difference between point i and i+1 is w_i+w+{i+1} like for wpct but normalized differently by c=w1+2w2+2w3+...+wN<2, slightly smaller than 2, the normalization for wpct.

Examples:

Assume uniform w=0.1. For any value betweem level = 0.1 to 0.2, DCDF just selects the two closest points and thus finds the smallest possible distance (except for edge-including-cases). DCDF can just achieve the largest slope out of all the CDFs. LH and WPCT have parallel CDFs here (no edges) and give us a result as expected. SYM behaves similarly (except for edge-including-cases) to WPCT but has the scaling factor advantage, I think that (except for edge-including-cases) the SYM result is WPCT(level*9/10).

Figure_1

Now consider a case where the interval includes the edges: Figure_2

In this level=0.85 case, the CI includes an edge point. This means LH gets to include a 0.1 jump and produces a shorter interval than DCDF (going from point 1 to halfway between 8 and 9, whereas DCDF has to go to point 9). The WPCT interval includes the edge as well so gets to include the 0.05 jump, so the interval from point 1 to point 9 provides the remaining 80%. SYM looks like a level 76.5 no-edge WPCT interval as expected (factor 10/9 advantage).

Note that WPCT happens to give the same result as DCDF just by accident because the "remaining 80%" are a multiple of 10%. The level=0.92 example shows WPCT not matching DCDF.

Figure_3

Well I just realized that the minimization is failing in a couple of cases for DCDF. sop.minimize_scalar(distance, bounds=(0, 1-level), method="Bounded") as well as sop.minimize(distance, (1-level)/2, bounds=[(0,1-level)], method="Nelder-Mead") seem to fail in some (different) cases. It should be trivial solve this as we know the shape of the function, and know where is should be flat. I.e. we can in theory manually check invCDF([0.05, 0.15, 0.25 etc.]) to find the minimum.

Stefan-Heimersheim avatar Aug 23 '22 14:08 Stefan-Heimersheim

Fixed DCDF now, last 2 figures again: Figure_2 Figure_3

Stefan-Heimersheim avatar Aug 23 '22 15:08 Stefan-Heimersheim

Okay so it looks like all methods underestimate the credible interval size for these Gaussian examples with uniform weights and N_data=10. The behavior for DCDF is as expected. level-dependent_size_error level-dependent_boundary_error And same for randomized weights: level-dependent_size_error_v2

Edit: In hindsight this could make sense. Looking at the Gaussian CDF, the curve is somewhat concave (convex) on the left (right) edge. Imagine you want to find the minimal interval fulfilling Delta Phi = 0.95. Then replacing the curve by a piece-wise linear interpolation "dents" the lines somewhat "inward" (closer to the "y-middle"), thus allowing you to find a shorter interval fulfilling Delta Phi = 0.95... Not sure if this reasoning is correct, this isn't a prediction but a way to explain what I see, but it seems plausible?

Stefan-Heimersheim avatar Aug 23 '22 15:08 Stefan-Heimersheim

We just had a discussion, with @williamjameshandley and @yallup, how we can get a better estimate of the CDF. The key is that the CDFs I drawn above all assign a fixed step in the CDF to each point, most saliently visible in the DCDF function: img

However it would be great to quantify the uncertainty / distribution of possible CDFs. To realize this we considered that any sample drawn from a PDF corresponds to uniform sample from the CDF y-axis. Before explaining this, note that we assume uniform weights here.

  1. We think integer weights could be modeled as multiple points with uniform weights and
  2. We think arbitrary weights could be arbitrarily well approximated by many points with integer weights.

(1) is how e.g. Metropolis Hastings MCMC works, and (2) seems like it would follow, I am not 100% sure if many samples with smaller weights are equivalent to few samples with larger weights, since getting multiple samples at the exact point usually does not happen, but we can deal with this later.


Drawing (uniform) x samples from a PDF is equivalent to drawing uniform samples s in [0,1] and calculating x=invCDF(s). (You can proof this simple with the probability "chain rule" the PDF of a function f(x), PDF(f) = d/df CDF(f) = d/dx dx/df CDF(x) = PDF(x) / f' and insert f(x)=CDF(x).) truth

So imagine you try to approximate the function CDF(x) (or invCDF(s)), you have drawn N unknown samples of s (uniform and independent), and get out the N numbers x=f(s). The only information you have is that CDFs are monotonic. What we are currently doing is assuming s=linspace(0,1,N) and plot the interpolation. theory But there is no reason to assume this, and in fact we know what s is -- s is just a set of N samples of Uniform[0,1] -- we can sample from this. Notably, we are not generating samples and computing (a new) DCDF, but sampling from the space of functions CDF(x) (or invCDF(s)). samples_0 Here are a 1000 different iterations of this: samples_1 We can plot a linear interpolation of these, although note again that linear interpolation is not giving an unbiased representation. samples_2 See for example how CDF(1.6) is no longer fixed to almost 1 but rather uncertain. In fact we can draw the probability distribution for CDF(1.6), it is simply the distribution of "the largest value of N samples of Uniform[0,1]".


We considered the distribution of the "gaps" between the s samples, they should be Dirichlet distributed (N+1 samples that sum to 1). samples_gaps

To calculate the distribution of the CDF at the nth point (in the weighted case, weights w_i), consider it's value from the sum of "gaps" sized g_i. Generalizing the idea of integer samples "just repeat the sample w times" we suspect the CDF sample should be given by CDF(x_n) = sum(i=0..n) g_i * w_i. To be continued ...

Stefan-Heimersheim avatar Aug 26 '22 15:08 Stefan-Heimersheim

Many thanks for this summary @Stefan-Heimersheim. One thing which occurred to me over the weekend as a potential gotcha with the "just repeat the sample w times" approach is that this does depend on the normalisation, i.e. you would get different results with the weights [1,1,3,5,1,1] in comparison with [100,100,300,500,100,100], the latter having less noise. I spent a bit of time this weekend thinking about how this relates to order statistics, but have yet to fully get my head around how to relate weights (which represent the portion of probability mass associated with each sample) to CDF samples.

williamjameshandley avatar Aug 30 '22 11:08 williamjameshandley

One thing which occurred to me over the weekend as a potential gotcha with the "just repeat the sample w times" approach is that this does depend on the normalisation, i.e. you would get different results with the weights [1,1,3,5,1,1] in comparison with [100,100,300,500,100,100], the latter having less noise.

Yeah, this is what I was thinking of too when I mentioned my uncertainty here!

I am not 100% sure if many samples with smaller weights are equivalent to few samples with larger weights, since getting multiple samples at the exact point usually does not happen

So just to write down my intuitions for both cases (A,B) here:

A) Multiple samples != few weighted samples: Imagine a histogram of 1,000 vs 10,000 samples -- you can pretty clearly draw the PDF, CDF, and read of intervals. But also clearly the accuracy of these will scale with having more samples, so 1,000 samples with weight=10 are not the same.

B) Oh the other hand,

  1. anesthetic's functions with the ncompress option resample uniformly weighted samples from an arbitrarily weights doesn't it? Although looking at the code it returns at most len(samples) new samples, so indeed it does not artificially increase the number of samples? I'm not entirely sure howsample_compression_1d()works ifncompress=len(x) and weights are strongly non-uniform, does this (utils.py:385`) return the correct result?
        x_ = np.random.choice(x, size=ncompress, replace=False)

Edit: Actually the right function is line 60-ish, ignore this

  1. MH MCMC returns the same parameter point multiple times in ~70% of cases (if acceptance ~30%), we are usually happy with these? Although we would say this gives a larger autocorrelation length and the effective number of samples is lower? But doesn't using chains with or without "thinning" give you the same result?

Anyway, before solving this issue I wanted to focus on the path from this to getting credibility intervals. From the figure above you can -- at least empirically -- derive the probability distribution for CDF(x_i). In between there will necessarily be interpolation (with the convexity/concavity problems discussed above) because we have random samples at fixed positions x_i. Potentially you could look at it the other way, invCDF(y) is sampled at random positions y but only giving fixed values x_i, this seems maybe more difficult to treat. repeat

To get a credibility interval we need CDF(b) - CDF(a). In the case of a and b aligning with sampling points, conditional on a given value CDF(a) we can get the distribution of CDF(b) from order statistics arguments (such as Will's Appendix A2, or potentially your Wikipedia link). We can (potentially analytically) marginalize over CDF(a) and get a distribution for CDF(b)-CDF(a), i.e. the size of that interval. More generally we could do this for arbitrary points using interpolation and marginalizing over the up to 4 relevant sampling positions.

Generally this will always give us a distribution & uncertainty on the volume of an interval "[3,7] is a 95%+/-3% interval" rather than a uncertainty on the size "[3+/-0.1, 7+/-0.2] is the 95% interval of this distribution". In theory we should be able to translate between these two measures, the same way we could translate from a distribution on f given x to a distribution on x given f in fgivenx.

However I don't think this is necessary. The goal of this code is mainly to be used for large N_samples and all this optimization for ~10 samples is not too useful. If a user with 1,000 samples gets the output "[3,7] is the 95%+/-0.4% CI" that's good enough for most cases, and importantly, transparent and with quantified uncertainty.

So I could think of proposing the algorithm

  1. Find a guess for the desired interval (say level=95%) [a,b] using say wcpt.
  2. Find the distribution of CDF(b) - CDF(a) based on analytical or empirical CDF distributions.
  3. Return [a,b] is a 95%+/-0.2% confidence interval, maybe with a warning if the uncertainty is >~ 0.1*(1-level).

Because this should be achievable and I expect [a,b] obtained this way to be close enough (for large N, for practical purposes) to the optimal "[a+/-sigma_a, b+/-sigma_b] = true 95% CI" result.

Stefan-Heimersheim avatar Aug 30 '22 13:08 Stefan-Heimersheim

Generally this will always give us a distribution & uncertainty on the volume of an interval "[3,7] is a 95%+/-3% interval" rather than a uncertainty on the size "[3+/-0.1, 7+/-0.2] is the 95% interval of this distribution".

We should be able to get either version from a sampling distribution:

  1. Assume you know the step heights of the CDF (we have been doing this all along with equal steps)
  2. Compute the credibility bounds [a, b] for these step heights as before (e.g. using the current minimisation scheme)
  3. Repeat step 1 and 2 over samples drawn from the distribution of CDF values above for nsamples iterations (in practice ~O(12) is usually enough to get a median and spread)
  4. you now have a set of [a, b] value for each of the samples from the CDF. These are samples drawn from the distribution, so taking a mean and std of these gives you the "[3+/-0.1, 7+/-0.2]" you're looking for.

You can of course reverse this process, changing [a, b] in the above for 'probability mass between fixed [a, b]' which gives you the first of your formulations (e.g. 95%+/-3%).

The whole premise of "thinking in samples" is that if you can do a forward analysis to get an output for a fixed set of inputs, then sampling those inputs and repeating the forward analysis to get a set of outputs gives you the sampling distribution of outputs. See slide 8, or 3.1 for more detail.

williamjameshandley avatar Aug 30 '22 13:08 williamjameshandley

Repeat step 1 and 2 over samples drawn from the distribution of CDF values above for nsamples iterations (in practice ~O(12) is usually enough to get a median and spread)

I see what you mean -- so let's say I do this 12 times with a low-number-of-samples-dataset, using CDFs sampled like above, think points like these: image from before Then I might get the results

[3,7] [3,7] [3,7] [1,7] [3,7] [5,9] [2,5], [3,7], [3,7], [3,7], [3,7], [3,7]

and compute [3+-..., 7+/-...].

Notable I would not use linear interpolation as it is systematically biased (tailed distributions always curve the same way), so the distribution would be quite discrete which might occasionally lead to many samples being equal / make it harder to compute mean and variance -- but that's fine until we run into div by 0 / all numbers identical errors, and then we deal with it.

Secondly (and this is not a bug either but a feature) it might happen (for particularly flat distributions) that we get things like [3,7], [4,8], [5, 9] which all are 68%+/-0.1% still leading to a big uncertainty 4+/-1ish. Which is true, the uncertainty on HPD-interval-boundary is large. But this would be an argument in favour of using errorbars on the interval size rather than the interval boundary.


Anyway, now to implementing something in practice ...

Stefan-Heimersheim avatar Oct 11 '22 11:10 Stefan-Heimersheim

Edit: Just discussed the issues in this comment with @yallup, see here, decided it does not work and a better way is using anesthetic's compression.


Let me also note down my thoughts on

One thing which occurred to me over the weekend as a potential gotcha with the "just repeat the sample w times" approach is that this does depend on the normalisation, i.e. you would get different results with the weights [1,1,3,5,1,1] in comparison with [100,100,300,500,100,100], the latter having less noise.

We should ask ourselves, what do we mean by integer weights? In a typical MCMC it means we have drawn multiple samples at this point. And it holds that the chance of drawing a sample at any point is proportional to the PDF at that point.

Those samples are not independent, so samples do not follow things like the distance-distribution P(distance between 2 random samples). P(2 identical random samples) == 0 but it does occur in MCMC. Is that a problem for our approach above?

previous image Empirically thinking of the usual image, think of two points at the same point. The y-distance between the CDF samples is a finite (non-zero) value, following some kind of distribution given by order statistics. Thus if we ask for the, say, 1% confidence interval, it might give a zero-sized interval (e.g. [3,3]) sometimes. Now the question is, is that correct? Let's say 5 times we get [3,3] and 2 times we get [3,4]. In the first 5 times the code will return [3,3] is the 0+/-0% credibility interval which is correct, and the user will notice they might be asking for an interval too small given their samples. Now in the other two cases the code might return [3,4] in the 5+/-4% credibility interval. Again, I think this would be correct and the user would notice what's up.

PS: In the latter 3,4 case is 5+/-4% computed with the "left" or "right" sample of 3? Hmm, this is not clear but clearly at most one version can be correct. This problem does not occur if we go with @williamjameshandley's approach for computing [3+-..., 7+/-...]-style intervals. So that might be the better option.


Let's briefly think of weights [100,100,300,500,100,100]. This would lead to a bunch of "lines of green dots" (as pictured above) bunched up together, and would make the jumps in the CDF much more pronounced. If we were to think of corresponding PDFs (take some interpolation) those could look rather something like this (artist's impression) rather than a Gaussian: screenshot_2022-10-11_13-07-20 I'm not sure what to make of this, given that MCMC could plausibly give you [100,100,300,500,100,100] for a Gaussian.


Let's think of how to incorporate fractional weights (which should also solve the issue above). @williamjameshandley mentioned the gaps-sampling idea a few before. The gaps here gaps image are sampled from some distribution (Dirichlet?), and we should be able to generalize this for non-uniform gaps. Important considerations are

  • Gap size distribution should somehow scale with weight(s) of sampling points, maybe even be proportional.
  • A gap is between two points, see all the discussion on left/right-sided CDF, wpct etc. we had above
  • The sum of gaps must equal 1, or rather CDF(-inf) must be 0 and CDF(inf) must be 1.
  • It's probably related to Will's order statistics things linked above

I'll give this a thought over lunch -- to be continued.

Stefan-Heimersheim avatar Oct 11 '22 12:10 Stefan-Heimersheim

Edit: Just discussed the issues in this comment with @yallup, see here, decided it does not work and a better way is using anesthetic's compression.


For uniform weights, the gap distribution seems to be indeed Dirichlet (in particular, with alpha_i=1). I think we also tested that when we chatted; here's a plot checking the marginal distribution (M=10 samples). The marginal distribution of one variable of the the Dirichlet is a Beta distribution so gap_i ~ sst.beta.pdf(gap, a=alpha_i, b=ngaps-alpha_i).

import numpy as np
import scipy.stats as sst
import matplotlib.pyplot as plt
N = 100000 #empirical test size
M = 10 #variables
data = np.random.uniform(0,1,size=(M, N))
sorted = np.sort(data, axis=0)
diffs = np.diff(sorted, axis=0)
gaps = np.array([sorted[0], *diffs, 1-sorted[-1]])
ngaps = M+1
plt.hist(gaps.T, bins=1000, histtype="step", density=True)
x = np.linspace(0,1,1000)
f = sst.beta.pdf(x, a=1, b=ngaps-1)
plt.plot(x,f, color="k", lw=2)
plt.show()

Figure_1

Just for reference, the sum of two gaps (i.e. gaps like this where we treat weight=2 as 2 samples) Figure_3 seems to be given by a Dirichlet distribution with alpha_i=weight_i=2.

plt.hist(gaps[6]+gaps[7], bins=1000, histtype="step", density=True)
f = sst.beta.pdf(x, a=2, b=ngaps-2)

Figure_2 which suggests that maybe we can do something along the lines of alpha = weights at some point.


But the fundamental question is still what do we do with samples (say x=3) occurring twice in the data? Earlier I wrote that this is a problem only if asking for the confidence interval [3,...] or [...,3] but actually it also affects "@williamjameshandley's approach". I don't think I solved this at all after all.

PS: In the latter 3,4 case is 5+/-4% computed with the "left" or "right" sample of 3? Hmm, this is not clear but clearly at most one version can be correct. This problem does not occur if we go with @williamjameshandley's approach for computing [3+-..., 7+/-...]-style intervals. So that might be the better option.

In that one, do you give the "jump" at 3 to both, [3,...] and [...,3]? Does it bias the confidence interval estimates even after we do all this random sampling on the y axis? (making intervals smaller / credibilities larger) What is the confidence interval here?

Figure_4

Stefan-Heimersheim avatar Oct 11 '22 15:10 Stefan-Heimersheim

Okay I just had a chat with @yallup and we concluded that it seems very hard or impossible to work with repeated samples, and to that extend probably also with non-uniform weights.

Fundamentally a "weights=100,200,500,200,100" case as mentioned before does behave fundamentally different in this CDF picture to a "weights=1,2,5,2,1" case, which it should not. In particular, the former will have much more certain less random steps in the CDF, making the jumps much stronger. Here's an example for 10 uniform samples vs the same samples with weight 10 each: Figure_A Figure_B


So since we fundamentally cannot deal with (integer) weights, or at least, get different results based on the normalization, we just limit ourselves to uniform weights. @yallup pointed out that we can just use anesthetic's compress function to get uniform-weight samples from any data set, so we can call this in credibility_interval.

I will implement this.

Stefan-Heimersheim avatar Oct 11 '22 16:10 Stefan-Heimersheim

Summary of thread above

This code

  • Computes samples-based confidence intervals fast
  • Does not use any kernel density estimation
  • Estimates uncertainty by sampling empirical CDFs from the data

We discussed

  • Various credibility-interval definitions and implementations
  • Handling of weights: It's hard to handle non-binary weights so we use compress_weights() with ncompress=-1
  • Dirichlet-distribution of steps in empirical cdf

The uncertainty is given as a covariance matrix (so errorbars on the intervals could be computed with sqrt(cov[i,i]) but of course the correlation is strong). Note that the uncertainty is from multiple CDF-samples from the same data, and

  • variance between multiple CDF-samples from the same data
  • variance between CDFs computed from multiple data-samples of the same distribution are not the same.

Figure_1

Checklist:

  • [x] I have performed a self-review of my own code
  • [x] My code is PEP8 compliant (flake8 anesthetic tests)
  • [x] My code contains compliant docstrings (pydocstyle --convention=numpy anesthetic)
  • [x] New and existing unit tests pass locally with my changes (python -m pytest)
  • [x] I have added tests that prove my fix is effective or that my feature works
  • [ ] I have appropriately incremented the semantic version number in both README.rst and anesthetic/_version.py

Not sure what's up with the version number, merge master in again first.

Stefan-Heimersheim avatar Apr 10 '23 23:04 Stefan-Heimersheim

Thank you @lukashergt for the great feedback, I plan to implement & answer those points within the next two weeks!

Stefan-Heimersheim avatar Apr 12 '23 13:04 Stefan-Heimersheim

For code coverage to pass the various cases 'll', 'ul', 'et' need to be tested specifically.

Right! Added tests & gave everything better names ('hpd'->'iso-pdf', 'll->'lower-limit' etc.)

Stefan-Heimersheim avatar Apr 26 '23 13:04 Stefan-Heimersheim

It might be nice to also add this as a method of our anesthetic.samples.Samples class (with automatic passing of weights etc.).

Done now (though all it does is pass weights automatically), let me know what you think of it!

Stefan-Heimersheim avatar Apr 26 '23 13:04 Stefan-Heimersheim

Bayesian bootstrap in particular (see e.g. the corresponding Wikipedia article). Is that essentially what we have implemented?

I'm not sure; the Wikipedia article describes doing this with respect to weights of samples, while for us it's CDF samples? I don't understand the bootstrap enough to answer this

Stefan-Heimersheim avatar May 15 '23 11:05 Stefan-Heimersheim

Thanks for all the suggestions @lukashergt! I've implemented everything now, except for the MultiIndex. I think that would have to be implemented in weighted_pandas similar to the other functions, but I'm really not sure how that works.

Stefan-Heimersheim avatar May 15 '23 12:05 Stefan-Heimersheim