xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Support specifying chunk sizes using labels (e.g. frequency string)

Open dcherian opened this issue 2 years ago • 5 comments

Is your feature request related to a problem?

dask.dataframe supports repartitioning or rechunking using a frequency string (freq kwarg).

I think this would be a useful addition to .chunk. It would help with some groupby problems (as suggested in this comment) and generally make a few problems amenable to blockwise/map_blocks solutions.

Describe the solution you'd like

  1. One solution is to allow .chunk(lon=5, time="MS"). There is some ugliness in that this syntax mixes up integer index values (lon=5) and a label-based frequency string time="MS"
  2. So perhaps a second method chunk_by_labels would be useful where chunk_by_labels(lon=5, time="MS") would rechunk the data so that a single chunk contains 5° of longitude points and a month of time. Alternative this could be .chunk(lon=5, time="MS", by="labels")

Describe alternatives you've considered

Have the user do this manually but that's kind of annoying, and a bit advanced.

Additional context

No response

dcherian avatar Feb 24 '23 17:02 dcherian

The chunk_by_labels functionality seems quite useful even when not talking about times, so I would be :+1: for that kind of option.

On the API question is there anywhere else in xarray where we have made some choice about how to let the user choose between specifying via indexes or labels? Apart from just .isel vs .sel I mean

TomNicholas avatar Feb 24 '23 18:02 TomNicholas

is there anywhere else in xarray where we have made some choice about how to let the user choose between specifying via indexes or labels?

coarsen vs groupby/groupby_bins/resample.

I explored this idea in this tutorial

I think it may be a fundamental concept for labelled array analysis. You need to pick whether you're working in "index space" like unlabelled arrays, or in "label space". This also came up in this issue where shift (and roll) operate in "index space".

Another example: Alignment is in "label space", broadcasting seems like "index space" (you just change shapes, but it does use dimension names to do that so maybe 50/50).

dcherian avatar Feb 24 '23 18:02 dcherian

Now I think the way to generalize is to eventually support Resampler objects.

I think overloading the existing .chunk is nicer that a new chunk_by method, but could be convinced otherwise.

I put up #9109 which allows specifying frequency strings.

dcherian avatar Jun 12 '24 21:06 dcherian

Responding to @shoyer's comment:

Are frequency strings unambiguous? Rechunking already supports memory sizes for Dask using strings.

The table here doesn't seem to overlap with MB, KB etc. but clearly this behaviour isn't tested. I'll fix that.

I see at least two ways to proceed with more explicit API:

  1. A more explicit opt-in could be using Resampler objects, which we are pretty close to making public.
  2. Alternatively we could add the more explicit chunk_by(time="5ME").

dcherian avatar Jun 13 '24 16:06 dcherian

  • A more explicit opt-in could be using Resampler objects, which we are pretty close to making public.

I like this option.

shoyer avatar Jun 13 '24 17:06 shoyer