anesthetic
anesthetic copied to clipboard
Implemented credibility_interval()
Description
Added a credibility_interval
method to MCMCSampels
/NestedSamples
which computes the credibility / confidence interval.
Fixes #178 and replaces #179
Notes:
- Discretization error: Using
np.cumsum
, the value of the CDF at the first data point is== its weight > 0
, but at the last data point is== 1
. This should be negligible forweights<<1
but for significant weights i.e. small sample sizes this can have an effect. - Should we add an alias function
confidence_interval
- I have also thrown in upper and lower limits which are not iso-probability intervals, but rely on the same code
Checklist:
- [x] I have performed a self-review of my own code
- [x] My code is PEP8 compliant (
flake8 anesthetic tests
) - [x] My code contains compliant docstrings
(
pydocstyle --convention=numpy anesthetic
) - [x] New and existing unit tests pass locally with my changes
(
python -m pytest
) - [x] I have added tests that prove my fix is effective or that my feature works
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 100.00%. Comparing base (
8726a90
) to head (3e6bd4b
).
Additional details and impacted files
@@ Coverage Diff @@
## master #188 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 36 36
Lines 3058 3127 +69
=========================================
+ Hits 3058 3127 +69
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
The sorting
order = np.argsort(samples)
samples = samples[order]
breaks with merged data sets, as in #189
Edit: This does not seem to be an issue anymore,
samples = anesthetic.MCMCSamples(np.random.rand(1000,2),columns=["x","y"],index=np.random.randint(0,100,1000))
anesthetic.utils.credibility_interval(samples.x)
works.
Re CI/lint flake8:
anesthetic/samples.py:11:1: F401 'scipy.interpolate.interp1d' imported but unused
anesthetic/samples.py:12:1: F401 'scipy.optimize.minimize_scalar' imported but unused
Not really sure what's causing flake8 to report an error here. Otherwise this should be good to go; only the quantile treatment isn't merged between credibility_interval
and quantile
.
My weighted ECDF was slightly different from yours (defined by min sample has CDF 0, max sample has CDF 1, and the intermediate points have a symmetric average of the weights either side).
I am a bit unsure about the new ECDF function:
Looking at say CDF(10)-CDF(3) in this example, there are 7 out of 10 samples in the interval [3,10], but the ECDF (blue line) gives a larger value -- is that a more-correct approximation that the version I initially implemented (green/red lines)?
Excellent plot!
I've done some investigating, and these really are second order effects. What we should be most concerned about is whether cdf(b)-cdf(a) approximates the true probability mass in between b and a.
import numpy as np
from scipy.interpolate import interp1d
from scipy.stats import norm
import matplotlib.pyplot as plt
# Generate a dataset
np.random.seed(3)
data = np.random.randn(30)
weights = np.random.rand(30)
i = np.argsort(data)
data = data[i]
weights = weights[i]/weights.sum()
# Symmetric empirical CDF
w_ = np.concatenate([[0.],weights[1:]+weights[:-1]]).cumsum()
w_/=w_[-1]
sym = interp1d(data, w_)
# Left hand empirical CDF
lh = interp1d(data, np.cumsum(weights))
# Right hand empirical CDF
rh = interp1d(data, 1-np.cumsum(weights[::-1])[::-1])
plt.plot(data,sym(data), label='Symmetric')
plt.plot(data,lh(data), label='Left Hand')
plt.plot(data,rh(data), label='Right Hand')
plt.plot(data,norm.cdf(data), label='True CDF')
plt.xlabel('x')
plt.ylabel('CDF')
plt.tight_layout()
plt.savefig('CDF.png')
plt.close()
So we see that all perform pretty equivalently relative to the correct answer.
Increasing the number of samples improves performance, but the three empirical curves lie on top of one another, which is different at lower order to the true answer.
Examining the histograms of the performance relative to the true answer shows them to all perform in an unbiased way
W, SYM, LH, RH, T = [], [], [], [], []
for _ in range(10000):
a, b = np.sort(np.random.uniform(data.min(),data.max(),size=(2)))
w = weights[(data < b) & (data > a)].sum()
W.append(w)
T.append(norm.cdf(b)-norm.cdf(a))
SYM.append(sym(b)-sym(a))
LH.append(lh(b)-lh(a))
RH.append(rh(b)-rh(a))
W, SYM, LH, RH, T = np.array([W, SYM, LH, RH, T])
plt.hist(W-T,bins=100)
plt.hist(SYM-T,bins=100)
plt.hist(LH-T,bins=100,alpha=0.8)
plt.hist(RH-T,bins=100,alpha=0.8)
plt.tight_layout()
plt.savefig('hist.png')
Increasing to 1000 points just reduces the spread, but not the difference in the methods.
Unless you can find a stress test which shows any of these to be better or worse than the others, I'd vote for the symmetric one.
I am confused, clearly in my CDF plot there symmetric method always gives you a steeper gradient, and thus should give always give you a larger CDF(b)-CDF(a) difference.
And here is your plot again without subtracting the True value:
plt.figure(constrained_layout=True)
plt.hist(SYM,bins=100, histtype="step", label="SYM")
plt.hist(LH,bins=100, histtype="step", label="LH")
plt.hist(RH,bins=100, histtype="step", label="RH")
plt.hist(T,bins=100, histtype="step", label="T")
plt.xlabel("CDF(b)-CDF(a)")
plt.legend()
plt.savefig('hist.png')
plt.show()
(I did use 10 data points with
weights = np.random.ones(10)
, I didn't get to double check if that is causing any difference in any of the codes.) Here is the same plot including T
:
Does the symmetric one do systematically worse? Do both methods have a systematic bias? And what would we expect in theory? Intuitively the LH/RH versions feel like implementations of the basic CDF definition, but I am not sure how to reason about the symmetric formula. I'll try to generalize this benchmark and figure out why we don't see that in your histogram.
Edit: Full code
import numpy as np
from scipy.interpolate import interp1d
from scipy.stats import norm
import matplotlib.pyplot as plt
# Generate a dataset
np.random.seed(3)
data = np.random.randn(10)
weights = np.ones(10)
i = np.argsort(data)
data = data[i]
weights = weights[i]/weights.sum()
# Symmetric empirical CDF
w_ = np.concatenate([[0.],weights[1:]+weights[:-1]]).cumsum()
w_/=w_[-1]
sym = interp1d(data, w_)
# Left hand empirical CDF
lh = interp1d(data, np.cumsum(weights))
# Right hand empirical CDF
rh = interp1d(data, 1-np.cumsum(weights[::-1])[::-1])
W, SYM, LH, RH, T = [], [], [], [], []
for _ in range(10000):
a, b = np.sort(np.random.uniform(data.min(),data.max(),size=(2)))
w = weights[(data < b) & (data > a)].sum()
W.append(w)
T.append(norm.cdf(b)-norm.cdf(a))
SYM.append(sym(b)-sym(a))
LH.append(lh(b)-lh(a))
RH.append(rh(b)-rh(a))
W, SYM, LH, RH, T = np.array([W, SYM, LH, RH, T])
plt.figure(constrained_layout=True)
plt.hist(SYM,bins=100, histtype="step", label="SYM")
plt.hist(LH,bins=100, histtype="step", label="LH")
plt.hist(RH,bins=100, histtype="step", label="RH")
plt.hist(T,bins=100, histtype="step", label="T")
plt.xlabel("CDF(b)-CDF(a)")
plt.legend()
plt.savefig('hist.png')
plt.show()
I knew I had thought about this before -- In #58, I suggest the use of the weighted percentile method. Turns out there's no universally agreed approach, but the recommended symmetric (C=1/2) version seems to be the best.
The python one-liner is
wpct = interp1d(data, (weights.cumsum()-weights/2)/weights.sum())
@Stefan-Heimersheim could you add this option to your plots?
Sure, here is the case of equal weights -- wpct
is the average between LH and RH, i.e. the same shifted again:
And thus gives identical results to RH & LH:
But in the case of non-equal weights it looks different, I am not sure yet if all the functions handled the weights correctly or if the differences make sense -- I will double check this as soon as possible.
Edit: I see the linked Wikipedia article also shows that the CDFs are not parallel so that makes sense
I'm just working on checking which method makes sense here. First a correction: The plot in my fastCI
repo only looked interpolated because I only passed the data points as x values to plt.plot
, the true sample-based CDF looks more like
(This is from a new branch I am working on.)
Now to the comparison:
- I expected the symmetric
sym
method to over-estimate the CDF based on the figure a few posts above.sym
looks identical to the currentanesthetic.utils
implementation. - I expected the discrete method (image above with the orange step-function-y line) to underestimate the inverse-CDF, as it can always find an interval that just about encloses enough points.
Reminder: These are our competitors:
Edit: As remarked before,
lh
, rh
, and wpct
are equivalent in the case of uniform weights, as are sym
and the anesthetic implementation (I think sym
is the anesthetic implementation), so I didn't show all the equivalent variants for the uniform-weights plots (I did check it though).
Now here come the benchmarks, always using a Gaussian (was easy) with loc=0, scale=1
(scaling doesn't matter). Runs with 10, 100, and 1000 data points for randomized and uniform weights.
And here as a function of number of data points (with a few more steps):
(Note that the only random seed is at the start of the file, I do not know why the neighboring N_data points seem correlated.)
Clearly the symmetric / anesthetic version over-estimates the CDF as expected, but I can't see a trend in the discrete case.
Edit: Code for these tests here. My next step was going to be running a benchmark of the Credibility Interval calculations with all/some of these CDFs, in particular I expect to see the underestimation of the interval in the discrete CDF case.
Very impressive! The first figure makes it pretty clear that sym
version is not suitable, and it does seem that it is the interpolation introducing the bias.
Is there a discrete version of sym
/the current anesthetic implementation? (I have a sneaking suspicion that this would be identical to wpct, which would be the one I'm leaning toward at the moment).
I would not read too much into the 'correlation' unless it remains upon doubling the density of N_data
points.
I expected the discrete method (image above with the orange step-function-y line) to underestimate the inverse-CDF, as it can always find an interval that just about encloses enough points.
This may average out however, since it will also find intervals which just about don't enclose enough points.
I would not read too much into the 'correlation' unless it remains upon doubling the density of N_data points.
That's easy, here's the same with N_data *= 2
I don't think I can do chi-by-eye here but it doesn't look too concerning.
Here's the error on a) left CI boundary, b) right CI boundary, and c) credibility interval size, always method values minus true theory value. Curiously all methods (sym, lh, wpct, discrete) seem to underestimate the interval sizes, this is unexpected. N_data=100
, randomized weights. Code here.
Edit: 10000 realizations with random data (Gaussian) and random credibility level [0,1].
Here are 20 example data realizations with CIs from the different methods (color code as here):
I suspect the performance of the discrete method depends strongly on the chosen level
. With 100 data points I suspect it will do well if 100*level
is an integer vs if 100*level+0.5
is an integer, or vice versa. Whereas the other methods should depend less on the precise value of level
. I'll test that tomorrow.
Edit: Okay one more, here the same but with N_data=10
and added the data points as scatter plots. Surprised by the discrete interval boundaries not matching points, investigate tomorrow.
Here's my intuition and reasoning about the 4 CDF cases we are considering. Excuse the anthropomorphisms, I found describing this as a race to find for the shortest interval easier to imagine. The 4 cases were
- Discrete CDF /
dcdf
: Green lines here, step-function-y CDF as shown here - LH / RH: Green and red lines here
- WPCT: Wikipedia weighted percentile with C=1/2 here
- SYM /
anest
: Symmetrical CDF as implemented in anesthetic, dashed line here
Here is my intuition for the shape of these CDFs. Always assuming weights to be normalized. When noted assuming the simplified case of equal and normalized weights w
, and sometimes considering the specific example with N_data=10
and w=0.1
. All CDFs go from CDF(-inf)=0, to CDF(inf)=1.
- DCDF: Step function jumping by
w_i
at thei
th data point. Highest slope is(w_i + w_{i+1})/distance
between neighboring points, as you can include both of the jumps in the interval. Generally the minimal interval for any credibility will always "just about" include the edge data points. In the case of equal weightsw=0.1
DCDF will always have an advantage of0.1
probability level over other methods such as LH, RH, and WPCT except in exceptions (see below). - LH/RH: Stepwise-linear function. The slope in any section is determined only by the weight of the data point right/left of that section.
Slope12=w2/distance
(LH) orSlope12=w1/distance
(RH). (Slope12
is the section in between the 1st and 2nd data point). - WPCT: Stepwise-linear function. The slope in any section is given by the average of the weights of the data points left and right of that section [1].
Slope12 = (w1+w2)/2/distance
.
Exception: LH, RH and WPCT also get jumps but only at edge points. LH (WH) gets a jump of w1
(wN
) at the left (right) edge, and WPCT has jumps of w1/2
and wN/2
at the left and right edges respectively.
- SYM: Same as WPCT, but without any jumps, and all slopes scaled by a factor of
2/(w1+2w2+3w3+...+wN)
, i.e. by a factor of10/9
in thew=0.1
case [2]. So it's like the non-jump-part of WPCT but stretched out in y-direction to reach 0 and 1, respectively.
I think In the case of uniform weights, DCDF should be the same as LH/RH/WPCT for level
rounded down to the nearest sum of weights (except for edge-including-cases). E.g. for w=0.1
DCDF(0.38)=WPCT(0.3).
Footnotes:
[1] To see this consider the equation,
weights.cumsum()-weights/2
. The first point will be w1/2, the 2nd point will be w1+w2/2 and so on, the difference between points 1 and 2 is (w1/2+w2/2). [2]w_ = np.concatenate([[0.],weights[1:]+weights[:-1]]).cumsum(); w_/=w_[-1]
--> [0, w1+w2, w1+2*w2+w3, ...,c]/c, so the difference between point i and i+1 is w_i+w+{i+1} like for wpct but normalized differently by c=w1+2w2+2w3+...+wN<2, slightly smaller than2
, the normalization for wpct.
Examples:
Assume uniform w=0.1
. For any value betweem level = 0.1
to 0.2
, DCDF just selects the two closest points and thus finds the smallest possible distance (except for edge-including-cases). DCDF can just achieve the largest slope out of all the CDFs.
LH and WPCT have parallel CDFs here (no edges) and give us a result as expected.
SYM behaves similarly (except for edge-including-cases) to WPCT but has the scaling factor advantage, I think that (except for edge-including-cases) the SYM result is WPCT(level*9/10).
Now consider a case where the interval includes the edges:
In this level=0.85
case, the CI includes an edge point. This means LH gets to include a 0.1 jump and produces a shorter interval than DCDF (going from point 1 to halfway between 8 and 9, whereas DCDF has to go to point 9). The WPCT interval includes the edge as well so gets to include the 0.05 jump, so the interval from point 1 to point 9 provides the remaining 80%. SYM looks like a level 76.5 no-edge WPCT interval as expected (factor 10/9 advantage).
Note that WPCT happens to give the same result as DCDF just by accident because the "remaining 80%" are a multiple of 10%. The level=0.92
example shows WPCT not matching DCDF.
Well I just realized that the minimization is failing in a couple of cases for DCDF. sop.minimize_scalar(distance, bounds=(0, 1-level), method="Bounded")
as well as sop.minimize(distance, (1-level)/2, bounds=[(0,1-level)], method="Nelder-Mead")
seem to fail in some (different) cases. It should be trivial solve this as we know the shape of the function, and know where is should be flat. I.e. we can in theory manually check invCDF([0.05, 0.15, 0.25 etc.])
to find the minimum.
Fixed DCDF now, last 2 figures again:
Okay so it looks like all methods underestimate the credible interval size for these Gaussian examples with uniform weights and N_data=10
. The behavior for DCDF is as expected.
And same for randomized weights:
Edit: In hindsight this could make sense. Looking at the Gaussian CDF, the curve is somewhat concave (convex) on the left (right) edge. Imagine you want to find the minimal interval fulfilling Delta Phi = 0.95. Then replacing the curve by a piece-wise linear interpolation "dents" the lines somewhat "inward" (closer to the "y-middle"), thus allowing you to find a shorter interval fulfilling Delta Phi = 0.95... Not sure if this reasoning is correct, this isn't a prediction but a way to explain what I see, but it seems plausible?
We just had a discussion, with @williamjameshandley and @yallup, how we can get a better estimate of the CDF. The key is that the CDFs I drawn above all assign a fixed step in the CDF to each point, most saliently visible in the DCDF function:
However it would be great to quantify the uncertainty / distribution of possible CDFs. To realize this we considered that any sample drawn from a PDF corresponds to uniform sample from the CDF y-axis. Before explaining this, note that we assume uniform weights here.
- We think integer weights could be modeled as multiple points with uniform weights and
- We think arbitrary weights could be arbitrarily well approximated by many points with integer weights.
(1) is how e.g. Metropolis Hastings MCMC works, and (2) seems like it would follow, I am not 100% sure if many samples with smaller weights are equivalent to few samples with larger weights, since getting multiple samples at the exact point usually does not happen, but we can deal with this later.
Drawing (uniform) x
samples from a PDF is equivalent to drawing uniform samples s
in [0,1] and calculating x=invCDF(s)
. (You can proof this simple with the probability "chain rule" the PDF of a function f(x)
, PDF(f) = d/df CDF(f) = d/dx dx/df CDF(x) = PDF(x) / f'
and insert f(x)=CDF(x)
.)
So imagine you try to approximate the function CDF(x)
(or invCDF(s)
), you have drawn N
unknown samples of s
(uniform and independent), and get out the N
numbers x=f(s)
. The only information you have is that CDFs are monotonic.
What we are currently doing is assuming s=linspace(0,1,N)
and plot the interpolation.
But there is no reason to assume this, and in fact we know what
s
is -- s
is just a set of N
samples of Uniform[0,1]
-- we can sample from this. Notably, we are not generating samples and computing (a new) DCDF, but sampling from the space of functions CDF(x)
(or invCDF(s)
).
Here are a 1000 different iterations of this:
We can plot a linear interpolation of these, although note again that linear interpolation is not giving an unbiased representation.
See for example how
CDF(1.6)
is no longer fixed to almost 1 but rather uncertain. In fact we can draw the probability distribution for CDF(1.6)
, it is simply the distribution of "the largest value of N
samples of Uniform[0,1]
".
We considered the distribution of the "gaps" between the s
samples, they should be Dirichlet distributed (N+1
samples that sum to 1).
To calculate the distribution of the CDF at the n
th point (in the weighted case, weights w_i
), consider it's value from the sum of "gaps" sized g_i
. Generalizing the idea of integer samples "just repeat the sample w
times" we suspect the CDF sample should be given by CDF(x_n) = sum(i=0..n) g_i * w_i
. To be continued ...
Many thanks for this summary @Stefan-Heimersheim. One thing which occurred to me over the weekend as a potential gotcha with the "just repeat the sample w
times" approach is that this does depend on the normalisation, i.e. you would get different results with the weights [1,1,3,5,1,1]
in comparison with [100,100,300,500,100,100]
, the latter having less noise. I spent a bit of time this weekend thinking about how this relates to order statistics, but have yet to fully get my head around how to relate weights (which represent the portion of probability mass associated with each sample) to CDF samples.
One thing which occurred to me over the weekend as a potential gotcha with the "just repeat the sample
w
times" approach is that this does depend on the normalisation, i.e. you would get different results with the weights[1,1,3,5,1,1]
in comparison with[100,100,300,500,100,100]
, the latter having less noise.
Yeah, this is what I was thinking of too when I mentioned my uncertainty here!
I am not 100% sure if many samples with smaller weights are equivalent to few samples with larger weights, since getting multiple samples at the exact point usually does not happen
So just to write down my intuitions for both cases (A,B) here:
A) Multiple samples != few weighted samples: Imagine a histogram of 1,000 vs 10,000 samples -- you can pretty clearly draw the PDF, CDF, and read of intervals. But also clearly the accuracy of these will scale with having more samples, so 1,000 samples with weight=10 are not the same.
B) Oh the other hand,
- anesthetic's functions with the
ncompress
option resample uniformly weighted samples from an arbitrarily weights doesn't it? Although looking at the code it returns at mostlen(samples)
new samples, so indeed it does not artificially increase the number of samples?I'm not entirely sure how
sample_compression_1d()works if
ncompress=len(x)and weights are strongly non-uniform, does this (
utils.py:385`) return the correct result?
x_ = np.random.choice(x, size=ncompress, replace=False)
Edit: Actually the right function is line 60-ish, ignore this
- MH MCMC returns the same parameter point multiple times in ~70% of cases (if acceptance ~30%), we are usually happy with these? Although we would say this gives a larger autocorrelation length and the effective number of samples is lower? But doesn't using chains with or without "thinning" give you the same result?
Anyway, before solving this issue I wanted to focus on the path from this to getting credibility intervals. From the figure above you can -- at least empirically -- derive the probability distribution for CDF(x_i). In between there will necessarily be interpolation (with the convexity/concavity problems discussed above) because we have random samples at fixed positions x_i.
Potentially you could look at it the other way, invCDF(y) is sampled at random positions y but only giving fixed values x_i, this seems maybe more difficult to treat.
To get a credibility interval we need CDF(b) - CDF(a). In the case of a and b aligning with sampling points, conditional on a given value CDF(a) we can get the distribution of CDF(b) from order statistics arguments (such as Will's Appendix A2, or potentially your Wikipedia link). We can (potentially analytically) marginalize over CDF(a) and get a distribution for CDF(b)-CDF(a), i.e. the size of that interval. More generally we could do this for arbitrary points using interpolation and marginalizing over the up to 4 relevant sampling positions.
Generally this will always give us a distribution & uncertainty on the volume of an interval "[3,7] is a 95%+/-3% interval" rather than a uncertainty on the size "[3+/-0.1, 7+/-0.2] is the 95% interval of this distribution". In theory we should be able to translate between these two measures, the same way we could translate from a distribution on f given x to a distribution on x given f in fgivenx.
However I don't think this is necessary. The goal of this code is mainly to be used for large N_samples
and all this optimization for ~10 samples is not too useful. If a user with 1,000 samples gets the output "[3,7] is the 95%+/-0.4% CI" that's good enough for most cases, and importantly, transparent and with quantified uncertainty.
So I could think of proposing the algorithm
- Find a guess for the desired interval (say level=95%) [a,b] using say
wcpt
. - Find the distribution of CDF(b) - CDF(a) based on analytical or empirical CDF distributions.
- Return [a,b] is a 95%+/-0.2% confidence interval, maybe with a warning if the uncertainty is >~ 0.1*(1-level).
Because this should be achievable and I expect [a,b] obtained this way to be close enough (for large N, for practical purposes) to the optimal "[a+/-sigma_a, b+/-sigma_b] = true 95% CI" result.
Generally this will always give us a distribution & uncertainty on the volume of an interval "[3,7] is a 95%+/-3% interval" rather than a uncertainty on the size "[3+/-0.1, 7+/-0.2] is the 95% interval of this distribution".
We should be able to get either version from a sampling distribution:
- Assume you know the step heights of the CDF (we have been doing this all along with equal steps)
- Compute the credibility bounds [a, b] for these step heights as before (e.g. using the current minimisation scheme)
- Repeat step 1 and 2 over samples drawn from the distribution of CDF values above for nsamples iterations (in practice ~O(12) is usually enough to get a median and spread)
- you now have a set of [a, b] value for each of the samples from the CDF. These are samples drawn from the distribution, so taking a mean and std of these gives you the "[3+/-0.1, 7+/-0.2]" you're looking for.
You can of course reverse this process, changing [a, b] in the above for 'probability mass between fixed [a, b]' which gives you the first of your formulations (e.g. 95%+/-3%).
The whole premise of "thinking in samples" is that if you can do a forward analysis to get an output for a fixed set of inputs, then sampling those inputs and repeating the forward analysis to get a set of outputs gives you the sampling distribution of outputs. See slide 8, or 3.1 for more detail.
Repeat step 1 and 2 over samples drawn from the distribution of CDF values above for nsamples iterations (in practice ~O(12) is usually enough to get a median and spread)
I see what you mean -- so let's say I do this 12 times with a low-number-of-samples-dataset, using CDFs sampled like above, think points like these:
Then I might get the results
[3,7] [3,7] [3,7] [1,7] [3,7] [5,9] [2,5], [3,7], [3,7], [3,7], [3,7], [3,7]
and compute [3+-..., 7+/-...].
Notable I would not use linear interpolation as it is systematically biased (tailed distributions always curve the same way), so the distribution would be quite discrete which might occasionally lead to many samples being equal / make it harder to compute mean and variance -- but that's fine until we run into div by 0 / all numbers identical errors, and then we deal with it.
Secondly (and this is not a bug either but a feature) it might happen (for particularly flat distributions) that we get things like [3,7]
, [4,8]
, [5, 9]
which all are 68%+/-0.1% still leading to a big uncertainty 4+/-1ish. Which is true, the uncertainty on HPD-interval-boundary is large. But this would be an argument in favour of using errorbars on the interval size rather than the interval boundary.
Anyway, now to implementing something in practice ...
Edit: Just discussed the issues in this comment with @yallup, see here, decided it does not work and a better way is using anesthetic's compression.
Let me also note down my thoughts on
One thing which occurred to me over the weekend as a potential gotcha with the "just repeat the sample w times" approach is that this does depend on the normalisation, i.e. you would get different results with the weights [1,1,3,5,1,1] in comparison with [100,100,300,500,100,100], the latter having less noise.
We should ask ourselves, what do we mean by integer weights? In a typical MCMC it means we have drawn multiple samples at this point. And it holds that the chance of drawing a sample at any point is proportional to the PDF at that point.
Those samples are not independent, so samples do not follow things like the distance-distribution P(distance between 2 random samples). P(2 identical random samples) == 0 but it does occur in MCMC. Is that a problem for our approach above?
Empirically thinking of the usual image, think of two points at the same point. The y-distance between the CDF samples is a finite (non-zero) value, following some kind of distribution given by order statistics. Thus if we ask for the, say, 1% confidence interval, it might give a zero-sized interval (e.g. [3,3]) sometimes. Now the question is, is that correct? Let's say 5 times we get [3,3] and 2 times we get [3,4]. In the first 5 times the code will return
[3,3] is the 0+/-0% credibility interval
which is correct, and the user will notice they might be asking for an interval too small given their samples. Now in the other two cases the code might return [3,4] in the 5+/-4% credibility interval
. Again, I think this would be correct and the user would notice what's up.
PS: In the latter 3,4
case is 5+/-4%
computed with the "left" or "right" sample of 3
? Hmm, this is not clear but clearly at most one version can be correct. This problem does not occur if we go with @williamjameshandley's approach for computing [3+-..., 7+/-...]
-style intervals. So that might be the better option.
Let's briefly think of weights [100,100,300,500,100,100]. This would lead to a bunch of "lines of green dots" (as pictured above) bunched up together, and would make the jumps in the CDF much more pronounced. If we were to think of corresponding PDFs (take some interpolation) those could look rather something like this (artist's impression) rather than a Gaussian:
I'm not sure what to make of this, given that MCMC could plausibly give you
[100,100,300,500,100,100]
for a Gaussian.
Let's think of how to incorporate fractional weights (which should also solve the issue above). @williamjameshandley mentioned the gaps-sampling idea a few before. The gaps here
are sampled from some distribution (Dirichlet?), and we should be able to generalize this for non-uniform gaps. Important considerations are
- Gap size distribution should somehow scale with weight(s) of sampling points, maybe even be proportional.
- A gap is between two points, see all the discussion on left/right-sided CDF,
wpct
etc. we had above - The sum of gaps must equal 1, or rather CDF(-inf) must be 0 and CDF(inf) must be 1.
- It's probably related to Will's order statistics things linked above
I'll give this a thought over lunch -- to be continued.
Edit: Just discussed the issues in this comment with @yallup, see here, decided it does not work and a better way is using anesthetic's compression.
For uniform weights, the gap distribution seems to be indeed Dirichlet (in particular, with alpha_i=1
). I think we also tested that when we chatted; here's a plot checking the marginal distribution (M=10 samples). The marginal distribution of one variable of the the Dirichlet is a Beta distribution so gap_i ~ sst.beta.pdf(gap, a=alpha_i, b=ngaps-alpha_i)
.
import numpy as np
import scipy.stats as sst
import matplotlib.pyplot as plt
N = 100000 #empirical test size
M = 10 #variables
data = np.random.uniform(0,1,size=(M, N))
sorted = np.sort(data, axis=0)
diffs = np.diff(sorted, axis=0)
gaps = np.array([sorted[0], *diffs, 1-sorted[-1]])
ngaps = M+1
plt.hist(gaps.T, bins=1000, histtype="step", density=True)
x = np.linspace(0,1,1000)
f = sst.beta.pdf(x, a=1, b=ngaps-1)
plt.plot(x,f, color="k", lw=2)
plt.show()
Just for reference, the sum of two gaps (i.e. gaps like this where we treat weight=2 as 2 samples)
seems to be given by a Dirichlet distribution with alpha_i=weight_i=2.
plt.hist(gaps[6]+gaps[7], bins=1000, histtype="step", density=True)
f = sst.beta.pdf(x, a=2, b=ngaps-2)
which suggests that maybe we can do something along the lines of alpha = weights at some point.
But the fundamental question is still what do we do with samples (say x=3) occurring twice in the data? Earlier I wrote that this is a problem only if asking for the confidence interval [3,...] or [...,3] but actually it also affects "@williamjameshandley's approach". I don't think I solved this at all after all.
PS: In the latter 3,4 case is 5+/-4% computed with the "left" or "right" sample of 3? Hmm, this is not clear but clearly at most one version can be correct. This problem does not occur if we go with @williamjameshandley's approach for computing [3+-..., 7+/-...]-style intervals. So that might be the better option.
In that one, do you give the "jump" at 3 to both, [3,...] and [...,3]? Does it bias the confidence interval estimates even after we do all this random sampling on the y axis? (making intervals smaller / credibilities larger) What is the confidence interval here?
Okay I just had a chat with @yallup and we concluded that it seems very hard or impossible to work with repeated samples, and to that extend probably also with non-uniform weights.
Fundamentally a "weights=100,200,500,200,100" case as mentioned before does behave fundamentally different in this CDF picture to a "weights=1,2,5,2,1" case, which it should not. In particular, the former will have much more certain less random steps in the CDF, making the jumps much stronger. Here's an example for 10 uniform samples vs the same samples with weight 10 each:
So since we fundamentally cannot deal with (integer) weights, or at least, get different results based on the normalization, we just limit ourselves to uniform weights. @yallup pointed out that we can just use anesthetic's compress function to get uniform-weight samples from any data set, so we can call this in credibility_interval
.
I will implement this.
Summary of thread above
This code
- Computes samples-based confidence intervals fast
- Does not use any kernel density estimation
- Estimates uncertainty by sampling empirical CDFs from the data
We discussed
- Various credibility-interval definitions and implementations
- Handling of weights: It's hard to handle non-binary weights so we use
compress_weights()
withncompress=-1
- Dirichlet-distribution of steps in empirical cdf
The uncertainty is given as a covariance matrix (so errorbars on the intervals could be computed with sqrt(cov[i,i])
but of course the correlation is strong). Note that the uncertainty is from multiple CDF-samples from the same data, and
- variance between multiple CDF-samples from the same data
- variance between CDFs computed from multiple data-samples of the same distribution are not the same.
Checklist:
- [x] I have performed a self-review of my own code
- [x] My code is PEP8 compliant (
flake8 anesthetic tests
) - [x] My code contains compliant docstrings (
pydocstyle --convention=numpy anesthetic
) - [x] New and existing unit tests pass locally with my changes (
python -m pytest
) - [x] I have added tests that prove my fix is effective or that my feature works
- [ ] I have appropriately incremented the semantic version number in both README.rst and anesthetic/_version.py
Not sure what's up with the version number, merge master in again first.
Thank you @lukashergt for the great feedback, I plan to implement & answer those points within the next two weeks!
For code coverage to pass the various cases 'll', 'ul', 'et' need to be tested specifically.
Right! Added tests & gave everything better names ('hpd'->'iso-pdf'
, 'll->'lower-limit'
etc.)
It might be nice to also add this as a method of our anesthetic.samples.Samples class (with automatic passing of weights etc.).
Done now (though all it does is pass weights automatically), let me know what you think of it!
Bayesian bootstrap in particular (see e.g. the corresponding Wikipedia article). Is that essentially what we have implemented?
I'm not sure; the Wikipedia article describes doing this with respect to weights of samples, while for us it's CDF samples? I don't understand the bootstrap enough to answer this
Thanks for all the suggestions @lukashergt! I've implemented everything now, except for the MultiIndex. I think that would have to be implemented in weighted_pandas
similar to the other functions, but I'm really not sure how that works.