xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Add SeasonGrouper, SeasonResampler

Open dcherian opened this issue 1 year ago • 5 comments

These two groupers allow defining custom seasons, and dropping incomplete seasons from the output. Both cases are treated by adjusting the factorization -- conversion from group labels to integer codes -- appropriately.

The last piece from #8509

  • [x] Closes #757
  • [x] Closes https://github.com/pydata/xarray/discussions/6180
  • [x] Closes https://github.com/pydata/xarray/discussions/6664
  • [x] Closes https://github.com/pydata/xarray/discussions/5134
  • [x] Closes https://github.com/pydata/xarray/discussions/6012
  • [x] Closes https://github.com/pydata/xarray/discussions/6865
  • [ ] Tests added
  • [ ] User visible changes (including notable bug fixes) are documented in whats-new.rst
  • [ ] New functions/methods are listed in api.rst

Example:

import xarray as xr
from xarray.groupers import SeasonGrouper, SeasonResampler

ds = xr.tutorial.open_dataset("air_temperature")

# custom seasons! 
ds.air.groupby(time=SeasonGrouper(["JF", "MAM", "JJAS", "OND"])).mean()

ds.air.resample(time=SeasonResampler(["DJF", "MAM", "JJAS", "ON"])).count()

TODO:

  • [ ] Needs a boatload of tests
  • [ ] narrative docs
  • [ ] support drop_incomplete in SeasonGrouper
  • [x] more cftime calendar support

cc @tomvothecoder do you have time to contribute some tests? I bet we'll simplify a bunch of xcdat this way, and you probably already have tests :)

dcherian avatar Sep 20 '24 03:09 dcherian

First comment, but I have performed only quick test

1 -

In your short example, it's probably : from xarray.groupers import SeasonGrouper, SeasonResampler and not : from xarray.core.groupers import SeasonGrouper, SeasonResampler

Or my Github knowledge is too limited, and I'm not testing the right branch.

2 - Season grouper

Seems OK for all I have tested. In particular I can :

  • Use it for one season only (subsampling): ds.air.groupby(time=SeasonGrouper(["MAM"]))['MAM']

  • Use it for 4 months season, as we often use in paleoclimate with calendar shifting (oversampling) ds.air.groupby(time=SeasonGrouper(["DJFM", "MAMJ", "JJAS", "SOND"])).mean()

3 - Season resampler

Works as expected from the example. I would really appreciate to have access to subsampling : ds.air.resample(time=SeasonResampler(["DJF", "MAM", "JJA", "ON"])) or : ds.air.resample(time=SeasonResampler(["JJAS"])) and to oversampling : ds.air.resample(time=SeasonResampler(["DJFM", "MAMJ", "JJAS", "SOND"]))

It could be useful to have a NaN value for an incomplete season : the first DJF cannot not be computed, and is not. This mean that the first value is not a DJF one, but a MAM value. Could be a bit misleading.

4 - cftime

I have tested it with cftime calendars instead of datetime. It works with the traditional calendar (gregorian, standard). But not with others like 360_day, 365_day, julian., proleptic_gregorian : TypeError: cannot compute the time difference between dates with different calendars

5 - Simple data

I've build a dataset with the number ot the month as a variable. So I'm sure that the computation is correct.

Thanks' for these features. They are quit easy and straigthforward to use.

In particular, it allows to work on variables, as xcdat features work on Dataset only, which yields a more complicated syntax.

I'm gonna try to imagine further tests.

Olivier

oliviermarti avatar Sep 20 '24 09:09 oliviermarti

Thanks @oliviermarti ! this is incredibly helpful

In your short example, it's probably : from xarray.groupers import SeasonGrouper, SeasonResampler

Yes, my mistake. I fixed the snippet.

ds.air.groupby(time=SeasonGrouper(["DJFM", "MAMJ", "JJAS", "SOND"])).mean()

This should not work, did you really get correct results.

It could be useful to have a NaN value for an incomplete season

The drop_incomplete option should let you control this.

dcherian avatar Sep 20 '24 15:09 dcherian

ds.air.groupby(time=SeasonGrouper(["DJFM", "MAMJ", "JJAS", "SOND"])).mean()

This should not work, did you really get correct results.

In fact not ! Only the first value is correct. A bit dangerous that it returns a result and not an error.

It could be useful to have a NaN value for an incomplete season The drop_incomplete option should let you control this.

drop_incomplete compute a value, using less month. This is the behaviour of xcdat. It would like to get a nan (I should ask that to xcdat too).

Olivier

oliviermarti avatar Sep 20 '24 15:09 oliviermarti

ds.air.groupby(time=SeasonGrouper(["DJFM", "MAMJ", "JJAS", "SOND"])).mean()

I went back to my dev notebook, turns out I figured this out already!

image

dcherian avatar Sep 20 '24 23:09 dcherian

cc @tomvothecoder do you have time to contribute some tests? I bet we'll simplify a bunch of xcdat this way, and you probably already have tests :)

Hi @dcherian, thank you for this PR! I've been looking forward to having this feature in Xarray. No guarantees on a timeline, but I plan to start looking at this PR this week. I'll experiment with this feature and see how I can leverage it to simplify xCDAT PR #423 for custom seasons. I'll also try to contribute any useful tests.

tomvothecoder avatar Sep 30 '24 16:09 tomvothecoder

Hey @dcherian, quick question. Will this PR add support for using SeasonGrouper along with Datetime components (e.g., ds.time.dt.year)?

For example, if I wanted to perform grouped averaging on year and custom seasons it might look like:

ds.air.groupby(time=[ds.time.dt.year, SeasonGrouper(["JF", "MAM", "JJAS", "OND"])]).mean()

tomvothecoder avatar Nov 13 '24 19:11 tomvothecoder

Yes, I think so this is conceptually similar to groupby(["time.month", "time.year"]) but we require dict inputs to be explicit

ds.air.groupby(
    {"time.year": UniqueGrouper(), "time": SeasonGrouper(["JF", "MAM", "JJAS", "OND"])}
).mean()

Here's an example, that is hopefully easy to interpret image

Is this right?

One trouble I'm having here is defining good tests for arbitrary seasons. Do you have suggestions or existing ones in xcdat that i can generalize?

dcherian avatar Nov 13 '24 19:11 dcherian

Yes, I think so this is conceptually similar to groupby(["time.month", "time.year"]) but we require dict inputs to be explicit

ds.coords["year"] = ds.time.dt.year
ds.air.groupby(year=UniqueGrouper, time=SeasonGrouper(["JF", "MAM", "JJAS", "OND"])).mean()

Here's an example, that is hopefully easy to interpret image

Is this right?

Yeah that's exactly what I was looking for. Thanks!

One trouble I'm having here is defining good tests for arbitrary seasons. Do you have suggestions or existing ones in xcdat that i can generalize?

Sure I have some in xcdat that you can adapt for this PR. I will share them with you shortly.

tomvothecoder avatar Nov 13 '24 22:11 tomvothecoder

Another question: If we're defining custom seasons with months that span the calendar year, those months are from the previous year correct?

For example for "NDJFM", "ND" should be from the previous year.

air.groupby(year=UniqueGrouper(), time=SeasonGrouper(["NDJFM"])) 

tomvothecoder avatar Nov 13 '24 23:11 tomvothecoder

Yes it tried to be that smart

dcherian avatar Nov 14 '24 02:11 dcherian

@tomvothecoder @oliviermarti i fixed the existing tests now, please try it out!

FWIW the need to support seasons=["JJAS"] is adding quite some complexity. We should consider not supporting it.

dcherian avatar Nov 14 '24 19:11 dcherian

I'm writing a few tests right now. How do you want me to add them to your fork branch?

Another question: If we're defining custom seasons with months that span the calendar year, those months are from the previous year correct?

For example for "NDJFM", "ND" should be from the previous year.

air.groupby(year=UniqueGrouper(), time=SeasonGrouper(["NDJFM"])) 

I noticed in a test I'm writing for the above code that "ND" is being taken from the same year, not the previous year. I think we expect the previous year "ND" to be used instead. I will show a clear example once I add the test.

tomvothecoder avatar Nov 15 '24 00:11 tomvothecoder

Ah nice find. A PR to this branch should be the easiest

dcherian avatar Nov 15 '24 00:11 dcherian

Ah nice find. A PR to this branch should be the easiest

Gotcha, will do.

RE: My comment above about annual seasonal averaging.

I've attached the Python script that compares the annual seasonal averages between Xarray and xCDAT. The custom seasons are "NDJFM", "AMJ". "NDJFM" spans the calendar year, so we expect the previous year "ND" to be used for grouping.

Results

Xarray (actual) uses the same year "ND" for grouping, while xCDAT (expected, PR #423) uses the previous year "ND". I manually verified that the averages for xCDAT are correct (here).

import numpy as np
import xarray as xr
import xcdat as xc  # noqa: F401
from xarray.groupers import SeasonGrouper, UniqueGrouper

# Create a sample dataset from 2001-01-01 to 2002-12-30
time = xr.cftime_range("2001-01-01", "2002-12-30", freq="MS", calendar="standard")
data = np.array(
    [
        1.0,
        1.25,
        1.5,
        1.75,
        2.0,
        1.1,
        1.35,
        1.6,
        1.85,
        1.2,
        1.45,
        1.7,
        1.95,
        1.05,
        1.3,
        1.55,
        1.8,
        1.15,
        1.4,
        1.65,
        1.9,
        1.25,
        1.5,
        1.75,
    ]
)
da = xr.DataArray(name="air", data=data, dims="time", coords={"time": time})
da["year"] = da.time.dt.year

# Actual (Xarray groupby with custom seasons)
# -------------------------------------------
actual = da.groupby(year=UniqueGrouper(), time=SeasonGrouper(["NDJFM", "AMJ"])).mean()

print(actual)

"""
Xarray uses the same year "ND" for "NDJFM" grouping (not expected).

<xarray.DataArray 'air' (year: 2, season: 2)> Size: 32B
array([[1.61666667, 1.38      ],
       [1.5       , 1.51      ]])
Coordinates:
  * year     (year) int64 16B 2001 2002
  * season   (season) object 16B 'AMJ' 'NDJFM'
"""

# Expected (xCDAT groupby with custom seasons)
# --------------------------------------------
ds = da.to_dataset()

custom_seasons = [["Nov", "Dec", "Jan", "Feb", "Mar"], ["Apr", "May", "Jun"]]
expected = ds.temporal.group_average(
    "air",
    weighted=False,
    freq="season",
    season_config={"custom_seasons": custom_seasons},
)

print(expected)
"""
xCDAT uses the previous year "ND" for "NDJFM" grouping (expected).

<xarray.DataArray 'air' (time: 5)> Size: 40B
array([1.25      , 1.61666667, 1.49      , 1.5       , 1.625     ])
Coordinates:
  * time     (time) object 40B 2001-01-01 00:00:00 ... 2003-01-01 00:00:00
Attributes:
    operation:                temporal_avg
    mode:                     group_average
    freq:                     season
    weighted:                 False
    drop_incomplete_seasons:  False
    custom_seasons:           ['NovDecJanFebMar', 'AprMayJun']
"""

print(expected.time)
"""
xCDAT represents time coords with cftime, with the middle month representing
the season.

<xarray.DataArray 'time' (time: 5)> Size: 40B
array([cftime.DatetimeGregorian(2001, 1, 1, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(2001, 5, 1, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(2002, 1, 1, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(2002, 5, 1, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(2003, 1, 1, 0, 0, 0, 0, has_year_zero=False)],
      dtype=object)
Coordinates:
  * time     (time) object 40B 2001-01-01 00:00:00 ... 2003-01-01 00:00:00
"""

tomvothecoder avatar Nov 18 '24 19:11 tomvothecoder

Xarray (actual) uses the same year "ND" for grouping, while xCDAT (expected, PR #423) uses the previous year "ND". I manually verified that the averages for xCDAT are correct (here).

In xCDAT, I get the indices all of all time coords with months that span the calendar year and shift them over a year (+1) before grouping with Xarray (since Xarray uses same year months for grouping). I haven't looked at the Xarray code for grouping yet, but there is probably a cleaner way to support spanning years.

tomvothecoder avatar Nov 19 '24 16:11 tomvothecoder

@tomvothecoder my mistake. that is a "resampling" operation, so

 da.resample(time=SeasonResampler(["NDJFM", "AMJ"], drop_incomplete=False)).mean()

gives what you want:

<xarray.DataArray (time: 5)> Size: 40B
array([1.25      , 1.61666667, 1.49      , 1.5       , 1.625     ])
Coordinates:
  * time     (time) object 40B 2000-11-01 00:00:00 ... 2002-11-01 00:00:00

We can't handle grouping by year and season separately.

dcherian avatar Nov 21 '24 04:11 dcherian

@oliviermarti & @tomvothecoder are you able to do one more test run here? It's basically complete though could use more tests as always.

dcherian avatar Mar 19 '25 04:03 dcherian

Deepak,

I wish I could try the new Groupers which are promising (as seen at https://discourse.pangeo.io/t/pangeo-showcase-xarrays-groupby-oh-my/4666)

My question is : how can I get the correct xarray branch for that ? Is it possible ?

I’ve try to run 2024-pangeo-showcase.ipynb with a fresh extract of xarray (main branch), and I get :

from xarray.groupers import SeasonGrouper

ImportError Traceback (most recent call last) Cell In[17], line 1 ----> 1 from xarray.groupers import SeasonGrouper

ImportError: cannot import name 'SeasonGrouper' from 'xarray.groupers' (/Users/marti/Unix/GitHub/xarray/xarray/groupers.py http://localhost:60151/Users/marti/Unix/GitHub/xarray/xarray/groupers.py)

Thanks for your help,

Olivier

Le 19 mars 2025 à 05:07, Deepak Cherian @.***> a écrit :

dcherian left a comment (pydata/xarray#9524) @oliviermarti & @tomvothecoder are you able to do one more test run here? It's basically complete though could use more tests as always.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.

https://github.com/oliviermarti https://github.com/tomvothecoder https://github.com/pydata/xarray/pull/9524#issuecomment-2735269806 https://github.com/notifications/unsubscribe-auth/AAVYYTTQVRO5SKOBXZCFOCT2VDUQZAVCNFSM6AAAAABOREL7ECVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMZVGI3DSOBQGY

dcherian left a comment (pydata/xarray#9524) https://github.com/pydata/xarray/pull/9524#issuecomment-2735269806 @oliviermarti https://github.com/oliviermarti & @tomvothecoder https://github.com/tomvothecoder are you able to do one more test run here? It's basically complete though could use more tests as always.

— Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/pull/9524#issuecomment-2735269806, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAVYYTTQVRO5SKOBXZCFOCT2VDUQZAVCNFSM6AAAAABOREL7ECVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMZVGI3DSOBQGY. You are receiving this because you were mentioned.

— Olivier Marti LSCE Bât 714 p. 1049 MERMAID Team Office : +33 1 69 08 77 27 Mobile : +33 6 45 36 43 74 @.***

oliviermarti avatar Apr 02 '25 14:04 oliviermarti

This PR branch is the branch: https://github.com/dcherian/xarray/tree/custom-groupers

dcherian avatar Apr 02 '25 15:04 dcherian

If you like this PR/API please voice support here: https://github.com/pydata/xarray/issues/10198 (a :+1: on the OP would be enough)

dcherian avatar Apr 02 '25 17:04 dcherian

This PR branch is the branch: https://github.com/dcherian/xarray/tree/custom-groupers

If I go there and search for SeasonGrouper, it is found in the design_notes/grouper_objects.md, but not in the code. See : https://github.com/search?q=repo%3Adcherian%2Fxarray+SeasonGrouper&type=code

I would be pleased to test SeasonGrouper. Or should I wait a litlle more ?

Olivier

oliviermarti avatar Apr 03 '25 08:04 oliviermarti

@oliviermarti & @tomvothecoder are you able to do one more test run here? It's basically complete though could use more tests as always.

Sure, I would like to. I just need more precision to get the right xarray version.

oliviermarti avatar Apr 03 '25 12:04 oliviermarti

I see it here: https://github.com/dcherian/xarray/blob/0d8210a6a5ffda32b80406ac077fec40a2a4fa15/xarray/groupers.py#L729

if you have the Github CLI it should be

git clone [email protected]:pydata/xarray.git
gh pr checkout 9524

or

git clone [email protected]:dcherian/xarray.git
git switch custom-groupers

dcherian avatar Apr 03 '25 16:04 dcherian

Hi,

My recent tests are really positive. The new Grouper syntax is simple and convenient. All this works with cftime calendars which is nice.

Now I would like to ask for two important features :

  • Weighted season averages.

Within a season, months have different lengths. So I need averages that reflect that, by weighting by the length of the months. When computing seasons from monthly means, it is not the case. My data (like model model outputs) have time_bounds that can be used to compute the time delta that could be used as weights.

  • Improved time axis.

I have data with time values at the middle of the time period (often middle of month), and time bounds at beginning/end of each period. This is very common in the CMIP database for instance. Model outputs are mainly averages on a given period (from the model time step to hour, day, month, etc ...). Period bounds are an important information.

I would like to have seasonnal averages that keep this kind of time axis : value at the middle of the period, and bounds. This will allows further computation with the right weighting by periods length. And plots will better reflect the averaging period.

Olivier

oliviermarti avatar Apr 04 '25 13:04 oliviermarti

Thanks for testing!

Yes, these are both important features, but we'll need to make progress on https://github.com/pydata/xarray/issues/8005 first, which is finally happening in https://github.com/pydata/xarray/pull/10137, https://github.com/pydata/xarray/pull/9671

dcherian avatar Apr 04 '25 14:04 dcherian

I'll try testing if I can soon! Thanks again for this PR @dcherian.

tomvothecoder avatar Apr 07 '25 16:04 tomvothecoder