xhistogram icon indicating copy to clipboard operation
xhistogram copied to clipboard

[WIP] Multidimensional bins

Open TomNicholas opened this issue 3 years ago • 2 comments

Work-in-progress attempt to implement #28 . Builds atop #49.

  • Allows bins to be xr.DataArrays
  • Aligns and reshapes the bins the same way as the other arrays
  • Promotes bins from a single keyword argument to a list of arrays similar to data or weights
  • Passes them down into _bincount_2d_vectorized
  • Applies np.digitize in a loop over the broadcast dimension using different bins (can't apply digitize directly to the whole array because it only accepts 1D bins input)
  • Reconstructs the bin coordinates on the output in an ND manner

However my test for checking the results of a 2D bins arrays doesn't actually pass and I haven't yet worked out why :/ It generates the right shape but the wrong counts.

(There are also a couple of other failing tests but I think they are just to do with input sanitization edge cases)

@aaronspring

TomNicholas avatar Jun 07 '21 18:06 TomNicholas

I've realised that this feature is kind of inconvenient to try and implement in xhistogram in its current form.

xhistogram's numpy API means that all ND bins have to be passed as pure numpy arrays through xhistogram.core.histogram, even when the bins are originally given as xr.DataArrays. To use those bins (in blockwise or plain _bincount) you have to be sure which axis is which, but that requires either passing already-broadcast bins arrays (breaking the API), providing extra information as to which axes are which on the bins (again breaking the API), or automatically broadcasting the numpy arrays inside xhistogram.core.histogram (possible but error-prone and frustrating when I would prefer to simply call xr.broadcast_like). The only reason to do even do this is to satisfy xhistogram's pure-numpy API (which I'm not sure anyone even uses).

Instead, IMO it would be much easier (/a better use of time) to implement this feature in xarray directly (either in https://github.com/pydata/xarray/pull/5400 or in a later PR). That's because in a pure xarray implementation there would be no need to have an internal function accepting only unlabelled numpy arrays: as you always know which axis on your given bins array is which you can grab the indices from the dim labels as needed and pass them straight to dask.array.blockwise.

I like to think that this PR will be useful as a template for the xarray implementation though.

@aaronspring FYI

TomNicholas avatar Jun 08 '21 19:06 TomNicholas

For future reference this SO discussion is about whether there are clever vectorized ways to do the bincounting when the weights are >1D.

TomNicholas avatar Jun 15 '21 02:06 TomNicholas