xcdat icon indicating copy to clipboard operation
xcdat copied to clipboard

Add weight threshold option for temporal operations

Open tomvothecoder opened this issue 1 year ago • 4 comments

Description

  • Related to #531
  • Related to #672

Checklist

  • [x] My code follows the style guidelines of this project
  • [x] I have performed a self-review of my own code
  • [x] My changes generate no new warnings
  • [x] Any dependent changes have been merged and published in downstream modules

If applicable:

  • [ ] I have added tests that prove my fix is effective or that my feature works
  • [ ] New and existing unit tests pass with my changes (locally and CI/CD build)
  • [x] I have commented my code, particularly in hard-to-understand areas
  • [x] I have made corresponding changes to the documentation
  • [ ] I have noted that this is a breaking change for a major release (fix or feature that would cause existing functionality to not work as expected)

tomvothecoder avatar Jul 30 '24 22:07 tomvothecoder

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 100.00%. Comparing base (683e973) to head (f21b2b6). :warning: Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff            @@
##              main      #683   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           16        16           
  Lines         1768      1781   +13     
=========================================
+ Hits          1768      1781   +13     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Jul 30 '24 22:07 codecov[bot]

To test this PR against real-world data, we can:

  1. Create a script that loops over model data with different weight threshold options
  2. Generate expected: CDAT temporal averaging with criteriaarg specified
  3. Generate result: Runs xCDAT temporal averaging with min_weight specified
  4. Compare results against expected for closeness and aligned np.nan values

tomvothecoder avatar Mar 26 '25 21:03 tomvothecoder

climatology and departures seems not having min_weight parameter added. Can they have it as like group_average for consistency?

lee1043 avatar Apr 23 '25 18:04 lee1043

climatology and departures seems not having min_weight parameter added. Can they have it as like group_average for consistency?

I am converting this PR back to draft and will tag you once it is ready.

tomvothecoder avatar Apr 23 '25 22:04 tomvothecoder

@pochedls I actually don't think this is dependent on #735. I will pick this one back up first since it is easier I think.

tomvothecoder avatar Jul 17 '25 17:07 tomvothecoder

@pochedls I actually don't think this is dependent on #735. I will pick this one back up first since it is easier I think.

I can't find where I linked these together, but I agree this is a related-but-separate issue.

pochedls avatar Jul 17 '25 18:07 pochedls

@pochedls I actually don't think this is dependent on #735. I will pick this one back up first since it is easier I think.

I can't find where I linked these together, but I agree this is a related-but-separate issue.

Sorry I wasn't clear. I mentioned at the meeting this issue was dependent on #735. I can actually tackle this before/at the same time as #735.

tomvothecoder avatar Jul 17 '25 20:07 tomvothecoder

While reviewing this I spotted a situation where things aren't working as we might want / expect:

# import xcdat
import xcdat as xc
import numpy as np

# open a dataset
p = '/p/user_pub/work/CMIP6/CMIP/E3SM-Project/E3SM-2-0/piControl/r1i1p1f1/Amon/tas/gr/v20220913/'
ds = xc.open_mfdataset(p)
dsa = ds.spatial.average('tas').load()

# lets now drop the first 8 months
# leaving four months in the first year
dsas = dsa.isel(time=slice(8, len(ds.time)))

# now compute the annual average with weights
dsasa = dsas.temporal.group_average('tas', freq='year', min_weight=0.5)

# but the first year yields a value (that is much colder than the rest of the annual means)
dsasa.tas.values[0]

np.float64(286.3752011035211)

Even though we specify a minimum weight of 0.5 and we're missing more than half the data in the first year, xcdat produces a value for the annual average, because the weights are computed such that each month has weight num_of_days_in_month / num_of_days_in_year_in_dataset. If only December is in the dataset, you get W = 31 / 31. But I think you would expect the weight to be W = 31 / 365. Logic here. So if months are missing, it doesn't affect the weights.

I think you'd expect the averager to behave more like if you has NaN-ed out the first 8 months (rather than dropped them):

# lets now NaN-out the first 8 months (rather than drop)
dsas = dsa.copy()
dsas['tas'][0:9] = np.nan

# now re-compute the annual average with weights
dsasa = dsas.temporal.group_average('tas', freq='year', min_weight=0.5)

# first year is now a NaN
dsasa.tas.values[0]

np.float64(nan)

Here you get the result that you expect.

This might beyond the scope of this PR (possibly dealt with in this PR).

pochedls avatar Jul 24 '25 23:07 pochedls

@lee1043 can you checkout this branch and perform testing with real data similar to the one for spatial averaging here?

Hi @lee1043, bugging you again for a review. This PR adds a min_weight option to group_average(), climatology(), and departures().

Example usage:

ds.temporal.group_average('ts', freq='year', min_weight=0.5)
ds.temporal.climatology('ts', freq='year', min_weight=0.5)
ds.temporal.departures('ts', freq='year', min_weight=0.5)

tomvothecoder avatar Aug 06 '25 17:08 tomvothecoder

ds.temporal.group_average('ts', freq='year', min_weight=0.5)
ds.temporal.climatology('ts', freq='year', min_weight=0.5)
ds.temporal.departures('ts', freq='year', min_weight=0.5)

Hi @tomvothecoder, the first line of the above runs without error, but the following two lines are giving the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[8], line 1
----> 1 ds_clm = ds.temporal.climatology('ts', freq='year', min_weight=0.5)

File [~/.conda/envs/xcdat_devel_20250806/lib/python3.13/site-packages/xcdat/temporal.py:657](https://jupyter.nersc.gov/user/lee1043/perlmutter-login-node-base/lab/tree/global/cfs/projectdirs/m4581/lee1043/work/xcdat_test/~/.conda/envs/xcdat_devel_20250806/lib/python3.13/site-packages/xcdat/temporal.py#line=656), in TemporalAccessor.climatology(self, data_var, freq, weighted, keep_weights, reference_period, season_config, skipna, min_weight)
    469 """Returns a Dataset with the climatology of a data variable.
    470 
    471 Data is grouped into the labeled time point for the averaging operation.
   (...)    653 }
    654 """
    655 self._set_data_var_attrs(data_var)
--> 657 return self._averager(
    658     data_var,
    659     "climatology",
    660     freq,
    661     weighted,
    662     keep_weights,
    663     reference_period,
    664     season_config,
    665     skipna,
    666     min_weight=min_weight,
    667 )

File [~/.conda/envs/xcdat_devel_20250806/lib/python3.13/site-packages/xcdat/temporal.py:931](https://jupyter.nersc.gov/user/lee1043/perlmutter-login-node-base/lab/tree/global/cfs/projectdirs/m4581/lee1043/work/xcdat_test/~/.conda/envs/xcdat_devel_20250806/lib/python3.13/site-packages/xcdat/temporal.py#line=930), in TemporalAccessor._averager(self, data_var, mode, freq, weighted, keep_weights, reference_period, season_config, skipna, min_weight)
    929 """Averages a data variable based on the averaging mode and frequency."""
    930 ds = self._dataset.copy()
--> 931 self._set_arg_attrs(
    932     mode, freq, weighted, reference_period, season_config, min_weight
    933 )
    935 # Preprocess the dataset based on method argument values.
    936 ds = self._preprocess_dataset(ds)

File [~/.conda/envs/xcdat_devel_20250806/lib/python3.13/site-packages/xcdat/temporal.py:1053](https://jupyter.nersc.gov/user/lee1043/perlmutter-login-node-base/lab/tree/global/cfs/projectdirs/m4581/lee1043/work/xcdat_test/~/.conda/envs/xcdat_devel_20250806/lib/python3.13/site-packages/xcdat/temporal.py#line=1052), in TemporalAccessor._set_arg_attrs(self, mode, freq, weighted, reference_period, season_config, min_weight)
   1051 freq_keys = TIME_GROUPS[mode].keys()
   1052 if freq not in freq_keys and "hour" not in freq:
-> 1053     raise ValueError(
   1054         f"Incorrect `freq` argument. Supported frequencies for {mode} "
   1055         f"include: {list(freq_keys)}."
   1056     )
   1058 self._mode = mode
   1059 self._freq = freq

ValueError: Incorrect `freq` argument. Supported frequencies for climatology include: ['season', 'month', 'day'].
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[9], line 1
----> 1 ds_dpt = ds.temporal.departures('ts', freq='year', min_weight=0.5)

File [~/.conda/envs/xcdat_devel_20250806/lib/python3.13/site-packages/xcdat/temporal.py:858](https://jupyter.nersc.gov/user/lee1043/perlmutter-login-node-base/lab/tree/global/cfs/projectdirs/m4581/lee1043/work/xcdat_test/~/.conda/envs/xcdat_devel_20250806/lib/python3.13/site-packages/xcdat/temporal.py#line=857), in TemporalAccessor.departures(self, data_var, freq, weighted, keep_weights, reference_period, season_config, skipna, min_weight)
    680 """
    681 Returns a Dataset with the climatological departures (anomalies) for a
    682 data variable.
   (...)    854 }
    855 """
    856 # 1. Set the attributes for this instance of `TemporalAccessor`.
    857 # ----------------------------------------------------------------------
--> 858 self._set_arg_attrs(
    859     "departures", freq, weighted, reference_period, season_config
    860 )
    861 self._set_data_var_attrs(data_var)
    863 # 2. Copy the original dataset and preprocess (if needed) for reuse.
    864 # ----------------------------------------------------------------------

File [~/.conda/envs/xcdat_devel_20250806/lib/python3.13/site-packages/xcdat/temporal.py:1053](https://jupyter.nersc.gov/user/lee1043/perlmutter-login-node-base/lab/tree/global/cfs/projectdirs/m4581/lee1043/work/xcdat_test/~/.conda/envs/xcdat_devel_20250806/lib/python3.13/site-packages/xcdat/temporal.py#line=1052), in TemporalAccessor._set_arg_attrs(self, mode, freq, weighted, reference_period, season_config, min_weight)
   1051 freq_keys = TIME_GROUPS[mode].keys()
   1052 if freq not in freq_keys and "hour" not in freq:
-> 1053     raise ValueError(
   1054         f"Incorrect `freq` argument. Supported frequencies for {mode} "
   1055         f"include: {list(freq_keys)}."
   1056     )
   1058 self._mode = mode
   1059 self._freq = freq

ValueError: Incorrect `freq` argument. Supported frequencies for departures include: ['season', 'month', 'day'].

Which is looking like freq=year was not implemented for climatology and departure. Let me know if I am missing something.

Sharing my test code in NERSC for the record:

import xcdat as xc
import os
data_dir = "/global/cfs/projectdirs/m4581/obs4MIPs/obs4MIPs_LLNL/MOHC/HadISST-1-1/mon/ts/gn/v20250415"
data_file = "ts_mon_HadISST-1-1_PCMDI_gn_187001-202501.nc"
input_file = os.path.join(data_dir, data_file)
ds = xc.open_dataset(input_file)

# plot
ds.ts.isel(time=0).plot()

# test
ds_ann = ds.temporal.group_average('ts', freq='year', min_weight=0.5)  # run ok
ds_clm = ds.temporal.climatology('ts', freq='year', min_weight=0.5)  # error
ds_dpt = ds.temporal.departures('ts', freq='year', min_weight=0.5)  # error

lee1043 avatar Aug 11 '25 22:08 lee1043

Which is looking like freq=year was not implemented for climatology and departure. Let me know if I am missing something.

Hi @lee1043, yes, there is no freq=year for climatology() or departures(). We implemented season/month/day for seasonal/annual/daily climatologies/departures (based on CDAT API).

tomvothecoder avatar Aug 12 '25 19:08 tomvothecoder

FYI: there is a related feature request: https://github.com/XCDAT/xcdat/issues/564

pochedls avatar Aug 12 '25 19:08 pochedls

It looks like "> min_weight" being used, wondering rather ">= min_weight" could be used. For example, when min_weight=1 (consider grid with no missing time step), I get an empty field (see below, rightmost figure).

The sample input above is sea-surface temperature that has time varying and location varying missing value over the Arctic and surround the Antarctica, which represents sea-ice. Over the land value are all missing.

import xcdat as xc
import os
import matplotlib.pyplot as plt

data_dir = "/global/cfs/projectdirs/m4581/obs4MIPs/obs4MIPs_LLNL/MOHC/HadISST-1-1/mon/ts/gn/v20250415"  # NERSC Perlmutter
data_file = "ts_mon_HadISST-1-1_PCMDI_gn_187001-202501.nc"
input_file = os.path.join(data_dir, data_file)

ds = xc.open_dataset(input_file)

da = ds.ts

# Count missing values
missing_count = da.isnull().sum(dim="time")

# Total number of time steps
total_count = da.sizes["time"]

# Fraction of missing values (between 0 and 1)
missing_fraction = missing_count / total_count

ds_ann_0 = ds.temporal.group_average('ts', freq='year', min_weight=0.1)
ds_ann_1 = ds.temporal.group_average('ts', freq='year', min_weight=0.5)
ds_ann_2 = ds.temporal.group_average('ts', freq='year', min_weight=0.99)
ds_ann_3 = ds.temporal.group_average('ts', freq='year', min_weight=1)

# Plot
fig, (ax0, ax1, ax2, ax3, ax4) = plt.subplots(1, 5, figsize=(22, 5))

missing_fraction.plot(ax=ax0)

ds_ann_0.ts.isel(time=0).plot(ax=ax1)
ax1.set_title("min_weight=0.1")
ds_ann_1.ts.isel(time=0).plot(ax=ax2)
ax2.set_title("min_weight=0.5")
ds_ann_2.ts.isel(time=0).plot(ax=ax3)
ax3.set_title("min_weight=0.99")
ds_ann_3.ts.isel(time=0).plot(ax=ax4)
ax4.set_title("min_weight=1")

plt.tight_layout()
Screenshot 2025-08-13 at 12 39 01 PM

By the way, considering the fact that the sample input above is sea-surface temperature monthly time series, can min_weight be less than 1 because of difference length of months?

lee1043 avatar Aug 13 '25 19:08 lee1043

BTW, I have confirmed that the following lines are working without error (changed freq to month as a follow up of https://github.com/xCDAT/xcdat/pull/683#issuecomment-3177123910 -- thanks @tomvothecoder and @pochedls for the comments!):

ds_clm = ds.temporal.climatology('ts', freq='month', min_weight=0.5)
ds_dpt = ds.temporal.departures('ts', freq='month', min_weight=0.5)

lee1043 avatar Aug 13 '25 19:08 lee1043

It looks like "> min_weight" being used, wondering rather ">= min_weight" could be used. For example, when min_weight=1 (consider grid with no missing time step), I get an empty field (see below, rightmost figure).

Thank you for the review so far @lee1043.

@pochedls can you or someone else grant me read permission to /globa/cfs/projectdirs/m4581? I'm getting PermissionError: [Errno 13] Permission denied: '/global/cfs/projectdirs/m4581/obs4MIPs/obs4MIPs_LLNL/MOHC/HadISST-1-1/mon/ts/gn/v20250415/ts_mon_HadISST-1-1_PCMDI_gn_187001-202501.nc' with Jiwoo's example script.

tomvothecoder avatar Aug 14 '25 21:08 tomvothecoder

@tomvothecoder I put the input data here in NERSC, can you check if you can access to it?

/pscratch/sd/l/lee1043/temporary/ts_mon_HadISST-1-1_PCMDI_gn_187001-202501.nc

lee1043 avatar Aug 14 '25 21:08 lee1043

@tomvothecoder I put the input data here in NERSC, can you check if you can access to it?

/pscratch/sd/l/lee1043/temporary/ts_mon_HadISST-1-1_PCMDI_gn_187001-202501.nc

This works, thanks Jiwoo.

tomvothecoder avatar Aug 14 '25 23:08 tomvothecoder