Improvements to multimodal HDI
Tell us about it
I propose several improvements to multimodal HDI:
- Expose configurable options. e.g. for float data, allow the user to specify
kdekeywords, in particularbwandgrid_len. - Switch to
bw='isj'as the default for float data. ISJ bandwidth selector performs very well in multimodal scenarios where one would typically want multimodal HDI. When modes are well-separated, Scott's rule oversmooths, producing HDI subintervals that are too wide; this behavior is inherited bybw='experimental'. - Add option to interpolate KDE density to get density at sample points, then construct intervals from
ceil(hdi_prob * len(data))points with highest density. The resulting interval bounds are selected from the sample points. Asymptotically this converges to the same HDI, but it performs much better in cases where the KDE is too approximate to construct a good HDI estimate, e.g. whengrid_lenis too low to capture peaks well. - For integer data, default to using a histogram with bin width of 1, exposing
binsas a keyword to allow the user to override this behavior. This allows much more accurate HDI estimates for cases where the current bin number selector produces too few bins to capture details in the distribution. And this can easily end up being more efficient than unimodal HDI. On the other hand, for cases like 100 draws from Poisson(10000), this approach would yield an HDI with many gaps. Still, I think it's better to show structure than hide it. - Unimodal HDI selects the smallest interval whose probability is >=
hdi_prob. Currently multimodal HDI for integer data and histograms of width 1 would select the same interval if its probability is exactlyhdi_proband otherwise would be omit 1 bin. This should be changed for consistency. For continuous multimodal HDI, both approaches would asymptotically be the same, but the proposed change should be adopted for consistency and code reuse.
Would it make sense to allow the user to specify an array of hdi_probs, all of which would be computed? Most of the computation would be shared for each hdi_prob, so this would allow faster HDI when more than 1 is needed for plotting.
Also, would it be better to work on this here or in arviz-stats?
Overall, this sounds very good.
I think most of our efforts should be on ArviZ 1.0, and then this should go directly to arviz-stats. Having said that, if there is something that is a low effort but with a relatively high impact, then we can add it to the current arviz. For instance, switching to bw='isj' as the default may fit into that category.