physicsnemo icon indicating copy to clipboard operation
physicsnemo copied to clipboard

Corrdiff generic Xarray dataloader

Open jleinonen opened this issue 2 months ago • 0 comments

PhysicsNeMo Pull Request

Description

Adds a generic Xarray-based dataloader (XarrayDataset) for CorrDiff. It is meant for users who have simple use cases and don't need/want to write their own dataloaders. It can also be used as a baseline for more complex dataloaders.

The feature list is intentionally kept compact to reduce clutter, supporting a few common use cases and optimizations:

  • Support for any file type that can be read by xarray.open_dataset, so e.g. NetCDF4 and Zarr will work.
  • One or more data files can be used
  • Data can optionally be pre-loaded to RAM, speeding up reads for small datasets
  • Specifying time ranges (e.g. for splitting to training and validation sets) and excluded sample times (e.g. to filter out bad data)
  • Sharding data so that each process will only use a slice of the dataset; this allows better caching and larger datasets to be used with load_to_memory == True.

A function create_sample_dataset is supplied in the xarray_generic.py module that can be used to generate a mock data file to be used as a template for real data files.

A YAML configuration file for the dataset is included.

Checklist

  • [x] I am familiar with the Contributing Guidelines.
  • [x] New or existing tests cover these changes.
  • [x] The documentation is up to date with these changes.
  • [ ] The CHANGELOG.md is up to date with these changes.
  • [ ] An issue is linked to this pull request.

Dependencies

jleinonen avatar Oct 22 '25 17:10 jleinonen