MetPy icon indicating copy to clipboard operation
MetPy copied to clipboard

Using MetPy to split up testing/training/validation xarray datasets for Machine Learning

Open ThomasMGeo opened this issue 7 months ago • 6 comments

What should we add?

Creating testing/training/validation datasets is a key step in machine learning workflows. Usually for Climate/Weather ML analysis, we split these datasets on a time dimension.

Scikit-learn has a function that does this for 2D arrays / pandas dataframes here. This function can't split xarray datasets.

Improvements on the scikit-learn implementation:

  1. Built for xarray datasets
  2. Can create a validation dataset (a third dataset) instead of doing it in two lines
  3. Can split datasets up in a useful way for time series analysis (do not split up datasets randomly for time series analysis!)

Big questions:

  1. Where should this go?
  2. can we use Xr.dataset.parse_cf() in a smart way to pull the time dimension automagically? This might not be required anyways.

Reference

No response

ThomasMGeo avatar Jul 22 '24 16:07 ThomasMGeo