xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Could we defer to flox for `GroupBy.first`?

Open max-sixty opened this issue 1 year ago • 4 comments

Is your feature request related to a problem?

I was wondering why a groupby("foo").first() call was going so slowly — I think we run a python loop for this, rather than calling into flox:

https://github.com/pydata/xarray/blob/b9780e7a32b701736ebcf33d9cb0b380e92c91d5/xarray/core/groupby.py#L1218-L1231

Describe the solution you'd like

Could we call into flox? Numbagg has the routines...

Describe alternatives you've considered

No response

Additional context

No response

max-sixty avatar Oct 18 '24 20:10 max-sixty

Yes , the minor complication is that we should dispatch nanfirst and nanlast but not first, last. The latter are simply indexing using an indexer we already know, so the reduction approach is overkill.

Closing https://github.com/pydata/xarray/issues/8025 in favor of this one.

Out of curiosity how many groups does your problem have?

dcherian avatar Oct 20 '24 04:10 dcherian

Sorry I missed #8025, I thought I searched; I guess first hit lots of unrelated issues and I missed it.

Out of curiosity how many groups does your problem have?

About 15K...

max-sixty avatar Oct 20 '24 19:10 max-sixty

About 15K...

Do you end up using dask for this, or just numbagg? Are these groups randomly distributed along the dimension, or are there patterns to how they are distributed (e.g. are they sequential)?

Just curious...

dcherian avatar Oct 21 '24 13:10 dcherian

Do you end up using dask for this, or just numbagg?

I ended up just leaving it running for hours!

Are these groups randomly distributed along the dimension, or are there patterns to how they are distributed (e.g. are they sequential)?

Yes they're largely sequential!

max-sixty avatar Oct 21 '24 17:10 max-sixty