physicsnemo
physicsnemo copied to clipboard
Corrdiff generic Xarray dataloader
PhysicsNeMo Pull Request
Description
Adds a generic Xarray-based dataloader (XarrayDataset) for CorrDiff. It is meant for users who have simple use cases and don't need/want to write their own dataloaders. It can also be used as a baseline for more complex dataloaders.
The feature list is intentionally kept compact to reduce clutter, supporting a few common use cases and optimizations:
- Support for any file type that can be read by
xarray.open_dataset, so e.g. NetCDF4 and Zarr will work. - One or more data files can be used
- Data can optionally be pre-loaded to RAM, speeding up reads for small datasets
- Specifying time ranges (e.g. for splitting to training and validation sets) and excluded sample times (e.g. to filter out bad data)
- Sharding data so that each process will only use a slice of the dataset; this allows better caching and larger datasets to be used with
load_to_memory == True.
A function create_sample_dataset is supplied in the xarray_generic.py module that can be used to generate a mock data file to be used as a template for real data files.
A YAML configuration file for the dataset is included.
Checklist
- [x] I am familiar with the Contributing Guidelines.
- [x] New or existing tests cover these changes.
- [x] The documentation is up to date with these changes.
- [ ] The CHANGELOG.md is up to date with these changes.
- [ ] An issue is linked to this pull request.