Add weight threshold option for temporal operations
Description
- Related to #531
- Related to #672
Checklist
- [x] My code follows the style guidelines of this project
- [x] I have performed a self-review of my own code
- [x] My changes generate no new warnings
- [x] Any dependent changes have been merged and published in downstream modules
If applicable:
- [ ] I have added tests that prove my fix is effective or that my feature works
- [ ] New and existing unit tests pass with my changes (locally and CI/CD build)
- [x] I have commented my code, particularly in hard-to-understand areas
- [x] I have made corresponding changes to the documentation
- [ ] I have noted that this is a breaking change for a major release (fix or feature that would cause existing functionality to not work as expected)
Codecov Report
:white_check_mark: All modified and coverable lines are covered by tests.
:white_check_mark: Project coverage is 100.00%. Comparing base (683e973) to head (f21b2b6).
:warning: Report is 1 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #683 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 16 16
Lines 1768 1781 +13
=========================================
+ Hits 1768 1781 +13
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
To test this PR against real-world data, we can:
- Create a script that loops over model data with different weight threshold options
- Generate expected: CDAT temporal averaging with
criteriaargspecified - Generate result: Runs xCDAT temporal averaging with
min_weightspecified - Compare results against expected for closeness and aligned
np.nanvalues
climatology and departures seems not having min_weight parameter added. Can they have it as like group_average for consistency?
climatologyanddeparturesseems not havingmin_weightparameter added. Can they have it as likegroup_averagefor consistency?
I am converting this PR back to draft and will tag you once it is ready.
@pochedls I actually don't think this is dependent on #735. I will pick this one back up first since it is easier I think.
@pochedls I actually don't think this is dependent on #735. I will pick this one back up first since it is easier I think.
I can't find where I linked these together, but I agree this is a related-but-separate issue.
@pochedls I actually don't think this is dependent on #735. I will pick this one back up first since it is easier I think.
I can't find where I linked these together, but I agree this is a related-but-separate issue.
Sorry I wasn't clear. I mentioned at the meeting this issue was dependent on #735. I can actually tackle this before/at the same time as #735.
While reviewing this I spotted a situation where things aren't working as we might want / expect:
# import xcdat
import xcdat as xc
import numpy as np
# open a dataset
p = '/p/user_pub/work/CMIP6/CMIP/E3SM-Project/E3SM-2-0/piControl/r1i1p1f1/Amon/tas/gr/v20220913/'
ds = xc.open_mfdataset(p)
dsa = ds.spatial.average('tas').load()
# lets now drop the first 8 months
# leaving four months in the first year
dsas = dsa.isel(time=slice(8, len(ds.time)))
# now compute the annual average with weights
dsasa = dsas.temporal.group_average('tas', freq='year', min_weight=0.5)
# but the first year yields a value (that is much colder than the rest of the annual means)
dsasa.tas.values[0]
np.float64(286.3752011035211)
Even though we specify a minimum weight of 0.5 and we're missing more than half the data in the first year, xcdat produces a value for the annual average, because the weights are computed such that each month has weight num_of_days_in_month / num_of_days_in_year_in_dataset. If only December is in the dataset, you get W = 31 / 31. But I think you would expect the weight to be W = 31 / 365. Logic here. So if months are missing, it doesn't affect the weights.
I think you'd expect the averager to behave more like if you has NaN-ed out the first 8 months (rather than dropped them):
# lets now NaN-out the first 8 months (rather than drop)
dsas = dsa.copy()
dsas['tas'][0:9] = np.nan
# now re-compute the annual average with weights
dsasa = dsas.temporal.group_average('tas', freq='year', min_weight=0.5)
# first year is now a NaN
dsasa.tas.values[0]
np.float64(nan)
Here you get the result that you expect.
This might beyond the scope of this PR (possibly dealt with in this PR).
@lee1043 can you checkout this branch and perform testing with real data similar to the one for spatial averaging here?
Hi @lee1043, bugging you again for a review. This PR adds a min_weight option to group_average(), climatology(), and departures().
Example usage:
ds.temporal.group_average('ts', freq='year', min_weight=0.5)
ds.temporal.climatology('ts', freq='year', min_weight=0.5)
ds.temporal.departures('ts', freq='year', min_weight=0.5)
ds.temporal.group_average('ts', freq='year', min_weight=0.5) ds.temporal.climatology('ts', freq='year', min_weight=0.5) ds.temporal.departures('ts', freq='year', min_weight=0.5)
Hi @tomvothecoder, the first line of the above runs without error, but the following two lines are giving the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[8], line 1
----> 1 ds_clm = ds.temporal.climatology('ts', freq='year', min_weight=0.5)
File [~/.conda/envs/xcdat_devel_20250806/lib/python3.13/site-packages/xcdat/temporal.py:657](https://jupyter.nersc.gov/user/lee1043/perlmutter-login-node-base/lab/tree/global/cfs/projectdirs/m4581/lee1043/work/xcdat_test/~/.conda/envs/xcdat_devel_20250806/lib/python3.13/site-packages/xcdat/temporal.py#line=656), in TemporalAccessor.climatology(self, data_var, freq, weighted, keep_weights, reference_period, season_config, skipna, min_weight)
469 """Returns a Dataset with the climatology of a data variable.
470
471 Data is grouped into the labeled time point for the averaging operation.
(...) 653 }
654 """
655 self._set_data_var_attrs(data_var)
--> 657 return self._averager(
658 data_var,
659 "climatology",
660 freq,
661 weighted,
662 keep_weights,
663 reference_period,
664 season_config,
665 skipna,
666 min_weight=min_weight,
667 )
File [~/.conda/envs/xcdat_devel_20250806/lib/python3.13/site-packages/xcdat/temporal.py:931](https://jupyter.nersc.gov/user/lee1043/perlmutter-login-node-base/lab/tree/global/cfs/projectdirs/m4581/lee1043/work/xcdat_test/~/.conda/envs/xcdat_devel_20250806/lib/python3.13/site-packages/xcdat/temporal.py#line=930), in TemporalAccessor._averager(self, data_var, mode, freq, weighted, keep_weights, reference_period, season_config, skipna, min_weight)
929 """Averages a data variable based on the averaging mode and frequency."""
930 ds = self._dataset.copy()
--> 931 self._set_arg_attrs(
932 mode, freq, weighted, reference_period, season_config, min_weight
933 )
935 # Preprocess the dataset based on method argument values.
936 ds = self._preprocess_dataset(ds)
File [~/.conda/envs/xcdat_devel_20250806/lib/python3.13/site-packages/xcdat/temporal.py:1053](https://jupyter.nersc.gov/user/lee1043/perlmutter-login-node-base/lab/tree/global/cfs/projectdirs/m4581/lee1043/work/xcdat_test/~/.conda/envs/xcdat_devel_20250806/lib/python3.13/site-packages/xcdat/temporal.py#line=1052), in TemporalAccessor._set_arg_attrs(self, mode, freq, weighted, reference_period, season_config, min_weight)
1051 freq_keys = TIME_GROUPS[mode].keys()
1052 if freq not in freq_keys and "hour" not in freq:
-> 1053 raise ValueError(
1054 f"Incorrect `freq` argument. Supported frequencies for {mode} "
1055 f"include: {list(freq_keys)}."
1056 )
1058 self._mode = mode
1059 self._freq = freq
ValueError: Incorrect `freq` argument. Supported frequencies for climatology include: ['season', 'month', 'day'].
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[9], line 1
----> 1 ds_dpt = ds.temporal.departures('ts', freq='year', min_weight=0.5)
File [~/.conda/envs/xcdat_devel_20250806/lib/python3.13/site-packages/xcdat/temporal.py:858](https://jupyter.nersc.gov/user/lee1043/perlmutter-login-node-base/lab/tree/global/cfs/projectdirs/m4581/lee1043/work/xcdat_test/~/.conda/envs/xcdat_devel_20250806/lib/python3.13/site-packages/xcdat/temporal.py#line=857), in TemporalAccessor.departures(self, data_var, freq, weighted, keep_weights, reference_period, season_config, skipna, min_weight)
680 """
681 Returns a Dataset with the climatological departures (anomalies) for a
682 data variable.
(...) 854 }
855 """
856 # 1. Set the attributes for this instance of `TemporalAccessor`.
857 # ----------------------------------------------------------------------
--> 858 self._set_arg_attrs(
859 "departures", freq, weighted, reference_period, season_config
860 )
861 self._set_data_var_attrs(data_var)
863 # 2. Copy the original dataset and preprocess (if needed) for reuse.
864 # ----------------------------------------------------------------------
File [~/.conda/envs/xcdat_devel_20250806/lib/python3.13/site-packages/xcdat/temporal.py:1053](https://jupyter.nersc.gov/user/lee1043/perlmutter-login-node-base/lab/tree/global/cfs/projectdirs/m4581/lee1043/work/xcdat_test/~/.conda/envs/xcdat_devel_20250806/lib/python3.13/site-packages/xcdat/temporal.py#line=1052), in TemporalAccessor._set_arg_attrs(self, mode, freq, weighted, reference_period, season_config, min_weight)
1051 freq_keys = TIME_GROUPS[mode].keys()
1052 if freq not in freq_keys and "hour" not in freq:
-> 1053 raise ValueError(
1054 f"Incorrect `freq` argument. Supported frequencies for {mode} "
1055 f"include: {list(freq_keys)}."
1056 )
1058 self._mode = mode
1059 self._freq = freq
ValueError: Incorrect `freq` argument. Supported frequencies for departures include: ['season', 'month', 'day'].
Which is looking like freq=year was not implemented for climatology and departure. Let me know if I am missing something.
Sharing my test code in NERSC for the record:
import xcdat as xc
import os
data_dir = "/global/cfs/projectdirs/m4581/obs4MIPs/obs4MIPs_LLNL/MOHC/HadISST-1-1/mon/ts/gn/v20250415"
data_file = "ts_mon_HadISST-1-1_PCMDI_gn_187001-202501.nc"
input_file = os.path.join(data_dir, data_file)
ds = xc.open_dataset(input_file)
# plot
ds.ts.isel(time=0).plot()
# test
ds_ann = ds.temporal.group_average('ts', freq='year', min_weight=0.5) # run ok
ds_clm = ds.temporal.climatology('ts', freq='year', min_weight=0.5) # error
ds_dpt = ds.temporal.departures('ts', freq='year', min_weight=0.5) # error
Which is looking like
freq=yearwas not implemented for climatology and departure. Let me know if I am missing something.
Hi @lee1043, yes, there is no freq=year for climatology() or departures(). We implemented season/month/day for seasonal/annual/daily climatologies/departures (based on CDAT API).
FYI: there is a related feature request: https://github.com/XCDAT/xcdat/issues/564
It looks like "> min_weight" being used, wondering rather ">= min_weight" could be used. For example, when min_weight=1 (consider grid with no missing time step), I get an empty field (see below, rightmost figure).
The sample input above is sea-surface temperature that has time varying and location varying missing value over the Arctic and surround the Antarctica, which represents sea-ice. Over the land value are all missing.
import xcdat as xc
import os
import matplotlib.pyplot as plt
data_dir = "/global/cfs/projectdirs/m4581/obs4MIPs/obs4MIPs_LLNL/MOHC/HadISST-1-1/mon/ts/gn/v20250415" # NERSC Perlmutter
data_file = "ts_mon_HadISST-1-1_PCMDI_gn_187001-202501.nc"
input_file = os.path.join(data_dir, data_file)
ds = xc.open_dataset(input_file)
da = ds.ts
# Count missing values
missing_count = da.isnull().sum(dim="time")
# Total number of time steps
total_count = da.sizes["time"]
# Fraction of missing values (between 0 and 1)
missing_fraction = missing_count / total_count
ds_ann_0 = ds.temporal.group_average('ts', freq='year', min_weight=0.1)
ds_ann_1 = ds.temporal.group_average('ts', freq='year', min_weight=0.5)
ds_ann_2 = ds.temporal.group_average('ts', freq='year', min_weight=0.99)
ds_ann_3 = ds.temporal.group_average('ts', freq='year', min_weight=1)
# Plot
fig, (ax0, ax1, ax2, ax3, ax4) = plt.subplots(1, 5, figsize=(22, 5))
missing_fraction.plot(ax=ax0)
ds_ann_0.ts.isel(time=0).plot(ax=ax1)
ax1.set_title("min_weight=0.1")
ds_ann_1.ts.isel(time=0).plot(ax=ax2)
ax2.set_title("min_weight=0.5")
ds_ann_2.ts.isel(time=0).plot(ax=ax3)
ax3.set_title("min_weight=0.99")
ds_ann_3.ts.isel(time=0).plot(ax=ax4)
ax4.set_title("min_weight=1")
plt.tight_layout()
By the way, considering the fact that the sample input above is sea-surface temperature monthly time series, can min_weight be less than 1 because of difference length of months?
BTW, I have confirmed that the following lines are working without error (changed freq to month as a follow up of https://github.com/xCDAT/xcdat/pull/683#issuecomment-3177123910 -- thanks @tomvothecoder and @pochedls for the comments!):
ds_clm = ds.temporal.climatology('ts', freq='month', min_weight=0.5)
ds_dpt = ds.temporal.departures('ts', freq='month', min_weight=0.5)
It looks like "> min_weight" being used, wondering rather ">= min_weight" could be used. For example, when
min_weight=1(consider grid with no missing time step), I get an empty field (see below, rightmost figure).
Thank you for the review so far @lee1043.
@pochedls can you or someone else grant me read permission to /globa/cfs/projectdirs/m4581? I'm getting PermissionError: [Errno 13] Permission denied: '/global/cfs/projectdirs/m4581/obs4MIPs/obs4MIPs_LLNL/MOHC/HadISST-1-1/mon/ts/gn/v20250415/ts_mon_HadISST-1-1_PCMDI_gn_187001-202501.nc' with Jiwoo's example script.
@tomvothecoder I put the input data here in NERSC, can you check if you can access to it?
/pscratch/sd/l/lee1043/temporary/ts_mon_HadISST-1-1_PCMDI_gn_187001-202501.nc
@tomvothecoder I put the input data here in NERSC, can you check if you can access to it?
/pscratch/sd/l/lee1043/temporary/ts_mon_HadISST-1-1_PCMDI_gn_187001-202501.nc
This works, thanks Jiwoo.