ggdist Shortest Probability Interval support

Shortest Probability Interval support

Open arthur-albuquerque opened this issue 2 years ago • 20 comments

Hi Matthew,

Do you plan to support the Shortest Probability Interval (further details here) with functions like ggdist::median_spi?

Thanks!

Ah at the moment I probably don't have time to investigate it, though I don't in principle object if it is a better variant of HDIs. How does it differ from hdci()?

Jun 19 '22 20:06 mjskay

This thread by @bwiernik seems relevant.

Jun 20 '22 23:06 arthur-albuquerque

Ah hrm. So it's also based on density estimators? I guess I'm not sure the advantage without doing more comparison (or seeing comparisons others have done if there are some). One nice property of hdci is that it doesn't require a density estimator.

Jun 21 '22 01:06 mjskay

The major advantage of SPI over HDI is reduced Monte Carlo error error. https://doi.org/10.1007/s11222-015-9563-8

Both HDI via empirical shortest distance algorithm and SPI via the method linked above are implemented in bayetestR if you wanted to swap in that for HPDInterval

Jun 21 '22 07:06 bwiernik

Yeah the reason I'm interested specifically in comparisons of SPI to HDCI (not HDI) is that HDCI uses the CDF approach from HDInterval, so if those two methods are similar I can (lazily) just leave things as they are ;)

Jun 21 '22 19:06 mjskay

What's HDCI stand for?

Jun 21 '22 20:06 bwiernik

@DominiqueMakowski I think you had some simulation code for comparing intervals set up? Could you compare 80% and 95% intervals for HDI, HDCI, and SPI?

Jun 21 '22 20:06 bwiernik

"hdci yields the highest-density continuous interval" https://mjskay.github.io/ggdist/reference/point_interval.html

Jul 02 '22 02:07 arthur-albuquerque

I should probably state more clearly what hdci is in the docs. It uses the algorithm for a continuous interval from HDInterval, which basically uses the ECDF to find the smallest interval of the specified mass: https://github.com/mikemeredith/HDInterval/blob/main/R/hdiVector.R

It is nice because it is simple, fast, and does not rely on picking a density estimator. But if the SPI algorithm has lower error maybe I'd switch it to that. Or a better option might be for me to make sure point_interval(.interval = ...) is compatible with the interval functions in bayestestR.

Jul 02 '22 02:07 mjskay

I think you had some simulation code for comparing intervals set up?

I can't seem to find it, but it was basically a big loop generating various distributions of different sizes and computing the different types of CIs. I'm not sure how we can reliably assess the "error" 🤔

Unless it already exists, it would be quite interesting to create a gist/vignette/paper that would 1) describe and explain the advantages/caveats of each interval method, 2) provide clean, community-contributed, R code to compute them (that could be re-used in other packages), and 3) provide some benchmarking on efficiency & behavior

_{(PS: I will be co-editing a special issue in Mathematics on Advances in Statistical Computing and I'm pretty sure such work would be a great fit...)}

Jul 02 '22 02:07 DominiqueMakowski

Unless it already exists, it would be quite interesting to create a gist/vignette/paper that would 1) describe and explain the advantages/caveats of each interval method, 2) provide clean, community-contributed, R code to compute them (that could be re-used in other packages), and 3) provide some benchmarking on efficiency & behavior

This is a great idea! I would be happy to help with something like this but don't really have the time to lead it. Maybe I could see if one of my students is interested...

Jul 02 '22 03:07 mjskay

Maybe I could see if one of my students is interested...

good idea! also tagging ~@easystats/maintainers~ (oops that doesn't work outside of the org) @strengejacke @mattansb

Jul 02 '22 04:07 DominiqueMakowski

Sounds great! I'm in (:

Jul 02 '22 04:07 mattansb

Doesn't the current implementation of bayestestR::hdi() actually return HDCI (and doesn't estimate any densities, a-la Kruschke), i.e. always a single interval is returned? I'm still not clear how this isn't the same as PSI...?

Jul 02 '22 04:07 mattansb

Ah, I see this is a different algo to estimate the same thing as HDCI. Cool!

Jul 02 '22 04:07 mattansb

Yeah, I'm interested in supporting a paper, too!

Jul 02 '22 09:07 strengejacke

Just need someone that actually does the job now 😁

Jul 02 '22 09:07 DominiqueMakowski

Alright I started playing with this for my own curiosity. Some comparisons here: https://github.com/mjskay/interval-estimators/blob/main/interval-estimators.md (source Rmd in same repo).

I can see three broad classes worth investigating:

equi-tailed intervals, for which the usual estimator is the quantile interval (which could come in several forms; e.g. quantile() has 9 different quantile estimators). ggdist::qi() and bayestestR::eti() are examples
shortest intervals, for which there is the empirical shortest interval based on quantiles (which could come in as many forms as the equi-tailed intervals); I believe ggdist::hdci() and bayestestR::hdi() are examples; then there is the "Spin" algorithm implemented as bayestestR::spi(); I also added a rough implementation of a variant that can use any of the 9 different quantile estimators in quantile() to the above document (see the function quantile_shortest_interval().
highest-density intervals, which unlike the other two may consist of multiple intervals. Implementations include ggdist::hdi() / HDInterval::hdi(density(x)), hdrcde::hdr(), and distributional::hdr()

I have been looking at a few quantities of interest:

root mean squared error of the lower and upper ends of the intervals
mean percentage overlap of the estimated interval with the true interval
coverage (probability mass of the true distribution contained by the interval)
variance (sd) of the endpoints of the intervals

Also to look at: bias.

One intriguing thing so far is that SPI doesn't perform much differently on these metrics compared to the empirical shortest interval implementations (with the exception of having coverage closer to nominal), which is surprising as the SPI paper says SPI should have lower error than HDCI. I'm not sure if that's an implementation issue or what.

The other thing I've found so far is that amongst HDI algorithms, distributional::hdr() is looking the best. In particular it has lower variance than others, performing consistently down near quantile intervals in terms of variance. All the other algorithms for HDIs or HDCIs tend to have higher variance than quantile intervals, which is why I have tended to recommend against HDIs and HDCIs. But changing the underlying implementation of HDIs to the one in distributional might reduce that concern for cases when HDIs are desired (so long as multiple intervals are acceptable).

See e.g. this chart: these are 80% ribbons; there are 100 groups along the x axis, each a sample of size 10,000 from N(0,1). As far as I'm concerned qi and (barely) hdr are the only two with acceptable variance here. The other three are different algorithms for shortest intervals. Given that 10,000 tends to be the upper end of effective sample sizes people tend to generate from MCMC, unless there's a better algorithm out there for shortest intervals we aren't aware of, I'm not sure I'd recommend using them...

Jul 04 '22 01:07 mjskay

I find it very surprising that even with 500 samples the methods don't differ that much (and that SPI seems to have relatively larger variance, if ever so slightly...).

Would it be interesting to compare these methods with ggdist::curve_interval() in the context of interval-bands? Or is that too different of an approach?

Jul 05 '22 07:07 mattansb

Would it be interesting to compare these methods with ggdist::curve_interval() in the context of interval-bands? Or is that too different of an approach

I think the comparison could make sense for single distributions. For multiple distributions, curve_interval() is not really looking at the same thing.

Jul 09 '22 18:07 mjskay

ggdist ggdist copied to clipboard

Shortest Probability Interval support

ggdist
ggdist copied to clipboard