arviz
arviz copied to clipboard
HDI naming
Hi,
I’m opening this because it’s not really clear why, in ArviZ’s output, the bounds for, say, a 94% HDI are named hdi_3%
and hdi_97%
. Such names seem to imply that they correspond respectively to the 3rd and 97th percentiles of the posterior distribution, which would make them equal-tailed credible intervals if that were the case, but it isn’t necessarily. (HDIs and equal-tailed credible intervals don’t have to coincide.) Indeed, the following program shows a counterexample:
import arviz as az
import pymc as pm
import scipy.stats
with pm.Model() as model:
pm.Exponential('expon', scale=100)
trace = pm.sample(10000)
summary = az.summary(trace, hdi_prob=.94)
print(summary)
hdi_lower = summary['hdi_3%'].expon
hdi_upper = summary['hdi_97%'].expon
dist = scipy.stats.expon(scale=100)
print(f"\nThe 94% HDI ([{hdi_lower:.3f}, {hdi_upper:.3f}]) contains {dist.cdf(hdi_upper) - dist.cdf(hdi_lower):.3%} of the probability.")
print(f"{dist.cdf(hdi_lower):.3%} of the probability is to the left of hdi_3%.")
print(f"{dist.sf(hdi_upper):.3%} of the probability is to the right of hdi_97%. ({dist.cdf(hdi_upper):.3%} to its left.)\n")
print(f"The equal-tailed 94% credible interval is [{dist.ppf(0.03):.3f}, {dist.ppf(0.97):.3f}].")
Here is its output:
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
expon 100.309 100.502 0.001 281.632 0.79 0.559 10051.0 8522.0 1.0
The 94% HDI ([0.001, 281.632]) contains 94.016% of the probability.
0.001% of the probability is to the left of hdi_3%.
5.983% of the probability is to the right of hdi_97%. (94.017% to its left.)
The equal-tailed 94% credible interval is [3.046, 350.656].
Is there maybe another reason for the naming? Or should it perhaps be reconsidered?
Thanks.
Thanks for reporting this. Probably a better name would be hdi_lb (lower bound) and hdi_up (upper bound). Do you want to submit a PR?
A downside to the proposal of hdi_lb
and hdi_ub
is that the column names no longer inform which HDI was included. Would it be crazy if we have a single hdi_94%
column that contains a 2-tuple or length 2 list with the bounds?
And what about something like hdi_94_lb
? The downside is that is maybe too long. Having tuples save horizontal space.
In arviz, I think numbers would be shown to 3 decimal points, so if they're on unit scale, we have 4 digits plus decimal plus maybe a sign plus whitespace for 12-14 total characters for the column contents. If using a tuple, this is 14-16 characters. While hdi_94%
would just take 7 characters, using column names hdi_94_lb
and hdi_94_ub
would take 20 characters. hdi94_lb
would be better, but still, this is 2-4 characters more than the tuple.
In PosteriorStats.jl, we show 3 significant figures, so the tuple takes 12-14 characters, so I'm more strongly in favor of tuples there. Plus, table display in Julia truncates columns from the right instead of the middle like pandas, so I'm more strongly in favor of tuples in PosteriorStats. Here, I'm less certain.
I'm in general not in favor of storing sequences in a cell of a data frame. However, I think this case is a good exception, so I vote in favor of the tuple. Having the column name be something like hdi_{percent}%
is better than the other alternatives I can think of.