rill icon indicating copy to clipboard operation
rill copied to clipboard

Improve bin count selection range

Open djbarnwal opened this issue 3 years ago • 4 comments

The current binning algorithm doesn't produce rounded boundaries. The boundaries of each bin are pretty roughly chosen so they end up being weird numbers like 34.20592.

It would be better if we could determine the selection ranges with nicer defaults in mind.

Part of #342

djbarnwal avatar Jun 06 '22 16:06 djbarnwal

I had some thoughts regarding this issue. The range of the bin would be defined by the algorithm we are going to implement in #375 . Once that is solidified we would be using that to determine the number of bins. With that in picture, the issue of having good edges somehow becomes mutually exclusive to the algorithm. For making the edges more "readable" we would have to mutate the bin sizes rendering the algorithm moot. Even if that is fine, the numeric profile UI would change a bit because we will no longer have the max element at the end. It would be somewhere just before the last bar.

Any thoughts on this @bcolloran ?

djbarnwal avatar Jun 15 '22 15:06 djbarnwal

tl;dr: i think good edges are not that important if there is a tradeoff that we have to make

--

@djbarnwal I guess that, up reflecting on it, my take is that the bin edges might not matter. One question: are these weird bin edge numbers ever shown to the user? I was trying to poke around in the UI and I couldn't spot any tooltips or other displays that printed the numeric value of any bin edge.

I guess I'd also say that if the task is to clearly depict a distribution, the bins and bin edges really are not the important part -- they are kind of an implementation detail that can be hidden from the person reading the plot. In the limit (kind of literally...) you can depict a distribution by plotting a kernel density estimate, which doesn't have bin edges b/c it doesn't have bins, and imo KDE plots are often the best way to display many distributions.

Another less commonly used but often superior way of drawing distributions is to do so by binning not in the data space, but in the CDF space. For example, rather than binning heights in steps of (100cm-120cm],(120cm-140cm],..., you bin the (sorted) obervations by quantile, for example (0%-2%], (2%-4%], ... ,(98%,100%]. If you go this way, then you know that every bin contains the same number of observations, but that also means that the bins will have uneven lengths. However, it doesn't matter for the purposes of drawing the distribution. See e.g. this https://aakinshin.net/posts/qrde-hd/ (I can't find a great reference, but this will hopefully get the idea across). I mention this not to suggest that we should implement something like this right now, but just to provide another example where the bin edges don't matter.

Having taken some time to think about this more deeply, it also makes me reconsider the conversation we had in some other context about the bin edge aliasing that I was seeing -- we're just using histograms as a tool to show distributions, but the placement and edges of the bins are just an implementation detail, not something we need to show the user. So perhaps we should not have the 1px of space between bins.

bcolloran avatar Jun 15 '22 16:06 bcolloran

@djbarnwal , Do we have open questions here or are we completing this as won't do?

magorlick avatar Jun 27 '22 14:06 magorlick

@magorlick We are not sure if we correctly understand what's needed here or what's the understanding behind having rounded selections. We were waiting for @hamilton to provide his thoughts on this when he's back.

djbarnwal avatar Jun 28 '22 07:06 djbarnwal

this has been superceded by #1473

hamilton avatar Jan 03 '23 06:01 hamilton