arviz HDI naming

Hi,

I’m opening this because it’s not really clear why, in ArviZ’s output, the bounds for, say, a 94% HDI are named hdi_3% and hdi_97%. Such names seem to imply that they correspond respectively to the 3rd and 97th percentiles of the posterior distribution, which would make them equal-tailed credible intervals if that were the case, but it isn’t necessarily. (HDIs and equal-tailed credible intervals don’t have to coincide.) Indeed, the following program shows a counterexample:

import arviz as az
import pymc as pm
import scipy.stats

with pm.Model() as model:
    pm.Exponential('expon', scale=100)
    trace = pm.sample(10000)

summary = az.summary(trace, hdi_prob=.94)
print(summary)

hdi_lower = summary['hdi_3%'].expon
hdi_upper = summary['hdi_97%'].expon

dist = scipy.stats.expon(scale=100)

print(f"\nThe 94% HDI ([{hdi_lower:.3f}, {hdi_upper:.3f}]) contains {dist.cdf(hdi_upper) - dist.cdf(hdi_lower):.3%} of the probability.")
print(f"{dist.cdf(hdi_lower):.3%} of the probability is to the left of hdi_3%.")
print(f"{dist.sf(hdi_upper):.3%} of the probability is to the right of hdi_97%. ({dist.cdf(hdi_upper):.3%} to its left.)\n")

print(f"The equal-tailed 94% credible interval is [{dist.ppf(0.03):.3f}, {dist.ppf(0.97):.3f}].")

Here is its output:

          mean       sd  hdi_3%  hdi_97%  mcse_mean  mcse_sd  ess_bulk  ess_tail  r_hat
expon  100.309  100.502   0.001  281.632       0.79    0.559   10051.0    8522.0    1.0

The 94% HDI ([0.001, 281.632]) contains 94.016% of the probability.
0.001% of the probability is to the left of hdi_3%.
5.983% of the probability is to the right of hdi_97%. (94.017% to its left.)

The equal-tailed 94% credible interval is [3.046, 350.656].

Is there maybe another reason for the naming? Or should it perhaps be reconsidered?

Thanks.

Jan 13 '24 09:01 sboukortt

Thanks for reporting this. Probably a better name would be hdi_lb (lower bound) and hdi_up (upper bound). Do you want to submit a PR?

Jan 16 '24 13:01 aloctavodia

A downside to the proposal of hdi_lb and hdi_ub is that the column names no longer inform which HDI was included. Would it be crazy if we have a single hdi_94% column that contains a 2-tuple or length 2 list with the bounds?

Feb 01 '24 22:02 sethaxen

And what about something like hdi_94_lb? The downside is that is maybe too long. Having tuples save horizontal space.

Feb 02 '24 00:02 aloctavodia

In arviz, I think numbers would be shown to 3 decimal points, so if they're on unit scale, we have 4 digits plus decimal plus maybe a sign plus whitespace for 12-14 total characters for the column contents. If using a tuple, this is 14-16 characters. While hdi_94% would just take 7 characters, using column names hdi_94_lb and hdi_94_ub would take 20 characters. hdi94_lb would be better, but still, this is 2-4 characters more than the tuple.

In PosteriorStats.jl, we show 3 significant figures, so the tuple takes 12-14 characters, so I'm more strongly in favor of tuples there. Plus, table display in Julia truncates columns from the right instead of the middle like pandas, so I'm more strongly in favor of tuples in PosteriorStats. Here, I'm less certain.

Feb 02 '24 08:02 sethaxen

I'm in general not in favor of storing sequences in a cell of a data frame. However, I think this case is a good exception, so I vote in favor of the tuple. Having the column name be something like hdi_{percent}% is better than the other alternatives I can think of.

Feb 18 '24 17:02 tomicapretto

arviz arviz copied to clipboard

HDI naming

arviz
arviz copied to clipboard