xhistogram icon indicating copy to clipboard operation
xhistogram copied to clipboard

feature: multi-dimensional bins

Open aaronspring opened this issue 3 years ago • 12 comments

currently, xhistogram only allows bins to be one-dimensional.

however, when the bin edges vary in time (seasonality) or space (locations of the globe) xhistogram cannot be used with multi-dim bins. there is a hard-coded requirement for bins elements to be 1-d. One of such multi-dim bin applications is the ranked probability score rps we use in xskillscore.rps, where we want to know how many forecasts fell into which bins. Bins are often defined as terciles of the forecast distribution and the bins for these terciles (forecast_with_lon_lat_time_dims.quantile(q=[.33,.66],dim='time')) depend on lon and lat.

How we solved this in xskillscore.rps: < gives us CDFs, and diff brings it back to histograms. maybe have to throw away the upper edge

Fc = (forecasts < forecasts_edges).mean('member').diff(bin_dim)
Oc = (observations < observations_edges).astype("int").diff(bin_dim)

https://github.com/xarray-contrib/xskillscore/blob/493f9afd7b5acefb4baa47bec6ad65fca19965bd/xskillscore/core/probabilistic.py#L680

I first implemented rps with xhistogram, then with the snippet above, yields same results.

However, I am not sure whether such multi-dimensional bins would be an interesting addition to xhistogram or are out-of-scope.

aaronspring avatar Mar 11 '21 06:03 aaronspring

I think this dask pR is related: https://github.com/dask/dask/pull/7346

dcherian avatar Mar 11 '21 14:03 dcherian

after quickly skimming over np.histogram2d and this issue, I think that I am asking here for a different thing: I what that bins can be multi-dimensional, whereas in np.histogram bins is a 1d-array or int.

aaronspring avatar Mar 11 '21 16:03 aaronspring

I think this is a really nice contribution that would be appreciated by many users, though I'm not sure whether it's out of scope for xhistogram (thoughts @rabernat?).

One simple option could be to allow bins to be an xarray object, in which case we use something like the approach you give above to compute the histogram?

dougiesquire avatar Apr 10 '21 02:04 dougiesquire

@aaronspring I'm trying to make sure I understand what you're proposing here - is this an accurate summary of what you're suggesting?:

If I have data which includes a time dimension, and I currently count along other dimension(s), my result will still have a time dimension, but the resultant bin coordinates are currently only allowed to be one-dimensional. You are proposing to be able to pass a multidimensional bins array (for each variable potentially for an N-D histogram) which has bin edges that are a function of time. The result would have the same shape array for the bin_counts, but the bin coordinates on the output would vary along this time dimension (passed straight through from your input). You would have ended up with a histogram whose counts and bin edges changed over time - effectively a set of separate histograms calculated independently for each point in time.

If that's what you're suggesting then it actually sounds fairly doable - it's basically just allowing the bins arguments to be ND and then making sure they broadcast properly. You would also need an input check that your bins arrays don't vary along any of the dimensions you want to count over, because I think that would be nonsensical.

TomNicholas avatar May 27 '21 16:05 TomNicholas

@TomNicholas didnt see that linked PR. thanks for linking again here.

simply put: I want to use nd instead of 1d arrays as bins in xhistogram.

whats the kind of API/code example I was looking for: https://gist.github.com/aaronspring/251553f132202cc91aadde03f2a452f9 (dont focus on the results, just the dimensionality)

aaronspring avatar Jun 01 '21 13:06 aaronspring

Here's how to do it with flox: https://github.com/xarray-contrib/flox/pull/203 [well hopefully I got it right ;)]

Rendered version

dcherian avatar Jan 18 '23 05:01 dcherian

Well this looks great!

I did have a little go in the airport after AMS but didn't get anywhere near this far 😅

Next step would be to test it against histogram (for the cases xhistogram already handles).

What's the plan with flox integration into Xarray? Will it always be optional? Will it become part of main?

On Wed, Jan 18, 2023, 12:01 AM Deepak Cherian @.***> wrote:

Here's how to do it with flox: xarray-contrib/flox#203 https://github.com/xarray-contrib/flox/pull/203 [well hopefully I got it right ;)]

Rendered version https://flox--203.org.readthedocs.build/en/203/user-stories/nD-bins.html

— Reply to this email directly, view it on GitHub https://github.com/xgcm/xhistogram/issues/28#issuecomment-1386490718, or unsubscribe https://github.com/notifications/unsubscribe-auth/AISNPI4JVVCPS5D2WPUWW5TWS52JNANCNFSM4Y7VCW6Q . You are receiving this because you were mentioned.Message ID: @.***>

TomNicholas avatar Jan 18 '23 05:01 TomNicholas

What's the plan with flox integration into Xarray? Will it always be optional? Will it become part of main?

Apparently flox will stay optional, so if we moved this functionality into xarray it would rely on an optional import, but that's okay.

TomNicholas avatar Jan 18 '23 18:01 TomNicholas

@dcherian how might this work for N-dimensional histograms? I.e. placing N variables into N sets of bins. That's obviously one of the main features xhistogram provides. I notice your notebook says

The core factorize_ function (which wraps pd.cut) only handles 1D bins, so we use xr.apply_ufunc to vectorize it for us.

Does that mean we would still have to do a reshape of some kind?

TomNicholas avatar Feb 03 '23 20:02 TomNicholas

https://flox.readthedocs.io/en/latest/intro.html#histogramming-binning-by-multiple-variables

Does this make it clear

dcherian avatar Feb 03 '23 21:02 dcherian

Ooh!

Checking I've understood this correctly:

xarray_reduce(
    da,                  # weights, here just ones
    "labels1",           # name of 1st variable we want to bin
    "labels2",           # name of 2nd variable we want to bin
    func="count",        # count occurrences falling in bins
    expected_groups=(
        pd.IntervalIndex.from_breaks([-0.5, 4.5, 6.5, 8.9]),  # bins for 1st variable
        pd.IntervalIndex.from_breaks([0.5, 1.5, 1.9]),        # bins for 2nd variable
    ),
)

TomNicholas avatar Feb 03 '23 21:02 TomNicholas

Yes. PR to improve that page is very welcome!

dcherian avatar Feb 03 '23 21:02 dcherian