ggdist icon indicating copy to clipboard operation
ggdist copied to clipboard

Documentation for default values of point intervals? I.e.; in stat_halfeye()?

Open ndphillipstalkiatry opened this issue 9 months ago • 3 comments

I love the visualizations in this package, unfortunately I struggled to find documentation (In the readme, cheatsheets, and function documentation) that explains what values are used by default for the point intervals in plots like half-eye.

See attached image for the specific values I'm referring to, labelled as A, B_Low, B_High, C_Low, C_High.

While I had a hunch the others correspond to a boxplot's quantiles, I couldn't tell for sure.

I dug into the function documentation for median_qi() which I realized was the default for stat_halfeye(point_interval), but even then after a few minutes of reading, I couldn't quite figure out the exact values were for the "Low" and "High" points in my attached image.

As I imagine most users will use the default values, it might be helpful to clarify this in the main README and in vignettes with a diagram explaining what these values are by default.

I'd also like to get confirmation on what these are before I use in a publication :)

Image

ndphillipstalkiatry avatar Apr 08 '25 23:04 ndphillipstalkiatry

I was hopeful that this diagram in vignette("slabinterval") would explain it, but this is more about custom aesthetics than the stats underlying the points and intervals. If we could have something just like this in the README but explaining the (default) stats that would solve my problem!

Image

ndphillipstalkiatry avatar Apr 08 '25 23:04 ndphillipstalkiatry

I agree that this is an important piece of documentation that's missing.

Since that documentation update hasn't appeared yet, and since I had the same question, here the result of my dive into the matter.

The defaults are the .66 and .95 quantile intervals. It's being set here: https://github.com/mjskay/ggdist/blob/bad56a2176afc2878f428396130a448f8a704378/R/abstract_stat_slabinterval.R#L26

By way of reference, this is different than the boxplot, which uses the interquartile range (IQR) for the box and 1.5 * IQR for the whiskers.

I modified the sample code from the vignette to show that the default width plot is identical to one where the widths are set explicitly to .66 and .95, and added the standard boxplot for comparison.

Hopefully this makes it clear.

library(ggplot2)
library(ggdist)
set.seed(1234)
df = data.frame(
  group = c("a", "b", "c"),
  value = rnorm(1500, mean = c(5, 7, 9), sd = c(1, 1.5, 1))
)
df %>%
  ggplot(aes(x = value, y = group)) +
  stat_pointinterval() +
  stat_pointinterval(.width = c(.66, .95), position = position_nudge(y = -.1)) +
  geom_boxplot(width = 0.1, position = position_nudge(y = .12))

Image

BeansIsFat avatar Jun 06 '25 14:06 BeansIsFat

Heh, you know, I thought I'd written this down somewhere, but looking through the docs it's not easy to unpack --- so thanks for raising this!

A version of the description of point_interval does exist in the original tidybayes documentation (the point and interval summary section here), though it was targeted at folks doing summaries of Bayesian posteriors and not really for descriptive statistics --- also doesn't help folks using ggdist not coming via tidybayes (which is probably the majority nowadays).

One could theoretically also put it together by looking at the default arguments of (say) stat_slabinterval(), which indicates the default .width is c(.66, .95) and default point_interval is median_qi, though then you'd have to read into the details of the point_interval page to know that qi means "quantile interval", and ... well that page needs a big overhaul to be usable.

All of this is to say, this info needs to be better described in a few places.

Besides that, to answer your question about what is the default: it is point_interval = median_qi and .width = c(.66, .95), which corresponds to a median, 66%, and 95% quantile intervals. ggdist specifies these quantities in terms of the mass contained in the resulting intervals instead of in terms of specific quantiles at the ends of the intervals because this approach generalizes to different interval estimators, including estimators that may not return continuous intervals with a set number of endpoints (such as highest-density intervals).

Fortunately if you do want just want some specific quantiles, quantile intervals (per their name) are directly defined as the equi-tailed interval formed by two quantiles containing the specified proportion. For example, the 95% quantile interval is (c(-1, 1) * 0.95 + 1) / 2 = c(0.0275, 0.975). Going the other way, the bounds of the box in a boxplot are usually the 25th percentile and the 75th percentiles, i.e. the 75% - 25% = 50% quantile interval.

The whiskers are a bit stranger, since they aren't always defined exactly in terms of quantiles. You can't exactly replicate the common rule of "most extreme datapoint within the top of the box plus 1.5 * IQR", but you could find a quantile interval that corresponds to "the top of the box plus 1.5 * IQR" on some reference distribution. On a standard Normal distribution the top of the box is qnorm(0.75) and the IQR is 2 * qnorm(0.75), so the upper limit is 4 * qnorm(0.75), the quantiles are pnorm(c(-1, 1) * qnorm(0.75) * 4) = c(0.0035, 0.9965), and the interval is roughly the diff(pnorm(c(-1, 1) * qnorm(0.75) * 4)) = 0.993 quantile interval.

On the example above that looks like this:

library(ggplot2)
library(ggdist)
set.seed(1234)
df = data.frame(
    group = c("a", "b", "c"),
    value = rnorm(1500, mean = c(5, 7, 9), sd = c(1, 1.5, 1))
)
df %>%
    ggplot(aes(x = value, y = group)) +
    stat_pointinterval() +
    stat_pointinterval(.width = c(.50, .993), position = position_nudge(y = -.1)) +
    geom_boxplot(width = 0.1, position = position_nudge(y = .12))
Image

I should say that for various reasons I probably wouldn't actually use the 1.5*IQR rule nor a 99.3% quantile interval (quite noisy to estimate), especially if you're showing the underlying data anyway. But hopefully this gives you a sense of what exactly is being calculated.


As for the docs, thanks for pointing out this info is hard to track down --- some TODOs for me:

  • [ ] add something to a vignette on point_interval() and .width and their defaults (probably the slabinterval vignette, or just make a vignette on interval estimators)
  • [ ] revamp the point_interval function docs --- probably split it up into multiple pages and add a better summary in the descriptions at the top of what is being calculated for each interval estimator.

mjskay avatar Oct 04 '25 06:10 mjskay