versioned NetCDFDataset
Description
For the sake of reproducibility most datasets should support versioning by default. NetCDFDataset does not.
Context
Xarray datasets are very useful for scientific purposes since you can have arrays with many different coordinates and metadata associated with them. Since these are usually used in a scientific context, this dataset should support versioning all the more.
Hi @jccalvojackson , thanks for opening this issue. We are doing some research on how Kedro can integrate with other systems that support versioning/checkpoints, such as DVC, Delta Lake, and Iceberg.
Given that our official NetCDFDataset is not versioned (as you point out), how are you working around this limitation at the moment?
Hi @jccalvojackson , thanks for opening this issue. We are doing some research on how Kedro can integrate with other systems that support versioning/checkpoints, such as DVC, Delta Lake, and Iceberg.
Given that our official
NetCDFDatasetis not versioned (as you point out), how are you working around this limitation at the moment?
I'm not at the moment 😅. Short term I guess I'll just log this dataset to mlflow. Mid, I'll implement the versioned version. Since I'm using other versioned datasets I didnt want to introduce a different tool to manage versioning. I dont know much about delta lake and iceberg, but seems like an overkill for most usecases, no? dvc seems promising.
For one of my projects I ended up using non versioned catalog + dvc as shown here.
Hi @jccalvojackson, do you still think it makes sense to make a versioned NetCDFDataset or does the catalog + dvc combo solve the issue?
Hi @jccalvojackson, do you still think it makes sense to make a versioned NetCDFDataset or does the catalog + dvc combo solve the issue?
yes, it does make sense to have it, please. I'm using catalog + dvc on one project. But I will still need the versioned flavor of this dataset on another.
Okay got it! We'd appreciate a PR for this from the community, so I'll add the help wanted label. It's not a team priority at the moment, so it might be a while before we get to this.
Very cool!
such as DVC, Delta Lake, and Iceberg.
For context for the non-Pangeo folks (/those unfamiliar with versioning NetCDF-like and Zarr-like datasets) Icechunk is a project that takes inspiration from Iceberg to provide per-chunk versioning on Zarr datasets. So this isn't an answer to this question of versioning on NetCDF, but I think there is potentially a future avenue for integration allowing pipelining with Zarr (with Icechunk used for versioning).
More generally, I think that Kedro might be useful for data pipelining in the Pangeo ecosystem[^1], and if so it would be great to have better support for it (as from a glance all I can see is a little bit of support for netcdf datasets?).
I still have a lot to learn about Kedro and learn more about how this pipelining lines up with the Pangeo ecosystem (if it does). ~~Thoughts on this @dcherian ?~~ actually, I can also just do a bunch of googling and blog reading myself to get more familiar with pipelining in EO :) (mainly want to know about your experience with data pipelining in the geoscience, and to know about your experience at Earthmover - have you encountered Kedro before? How is data pipelining and data engineering generally tackled in Pangeo?)
Hopefully if fruitful and useful this can turn into an impactful contribution from my end :))))) .
Thanks for the talk at PyconNL @merelcht and the chats after ! Really interesting hearing about Kedro
[^1]: A quick google search for "pangeo" "kedro" yields nothing, and I can't see this being explored anywhere else.