VirtualiZarr
VirtualiZarr copied to clipboard
Create a community structure for sharing VirtualiZarr workflows and Icechunk virtual stores / Kerchunk references
Many Kerchunk workflows were developed as one-off Jupyter Notebooks that were shared on as a GitHub Gist or at most Medium blog posts/conference presentations. While all these examples were fantastic, it was often difficult to find examples and understand their differences. https://github.com/ProjectPythia/kerchunk-cookbook provided a more consistent structure, but was built after the fact by a small number of people. I think it would be valuable to promote a structure for sharing VirtualiZarr workflows earlier in process, so that they are open, findable, and ideally consistently structured. I also think there's a lot to learn from STAC in this type of community organization and would like to propose mirroring the stactools-packages structure. In this model, we would:
- [ ] Create a Virtual-Zarr GitHub organization, akin to https://github.com/stactools-packages
- [ ] Create a template repository, akin to https://github.com/stactools-packages/template (xref #319)
- [ ] Create a method for people to transfer their repositories to https://github.com/virtual-zarr
- [ ] Create a landing page for Virtual Zarr datasets, akin to https://stactools-packages.github.io/
I think it would be great if we had a way for people to easily clone their virtual data stores to a publicly accessible location (to my knowledge this isn't in place for STAC). IIRC @norlandrhagen suggested source.coop as a potential hub for sharing the actual virtual stores.
@maxrjones I just stumbled across https://github.com/stac-utils/xstac/tree/main. They have examples using Kerchunk references to create STAC assets.
I love the forward-thinking-ness here, and I've also been mulling over what the world of findable virtual zarr stores could look like.
However, whilst I agree there is a lot to learn from STAC, I think we need to go quite a lot further than they have. In order to make all archival multidimensional scientific data actually "FAIR", we're going need many layers:
- The location of the original data (hopefully in object storage but maybe behind a http server still),
- The virtualizarr workflow code which generated virtual references and dumped them into icechunk,
- One or more icechunk stores (i.e. manifest files in object storage somewhere, not necessarily in the same bucket as the data),
- A catalog entry which contains additional information about the contents of the icechunk store (e.g. where the code used to generate it is), conforming to some catalog schema,
- A searchable global index of catalog entries,
- A website (landing page) which displays the catalog entries.
Of these layers, only (2) is actually executable code, which is why I don't think the solution to this will be to create lots and lots of small github repos. In theory all 6 could live in different places, and even be by managed by different organisations!
Currently (5) and (6) do not yet exist, at least not for Zarr specifically. (4) barely exists - arraylake's catalog is arguably a version of this. There are lots of existing prototypes to draw from (including from the STAC ecosystem), but none of them are as general as Zarr's data model. And this is before we even consider the idea of having non-zarr versioned data (e.g. iceberg) living alongside Zarr data...
One model of a solution here is that (4), (5), and (6) are all built and managed by one organization. That's GitHub's model - they have catalog entries (repos), search, and a master catalog website (github.com). (They also actually hold the equivalent of (1-3) in their systems too, we just don't really mind because every user of github automatically has a local backup of the whole history of all their code, in an open format.)
Another model is that catalog entries are created and hosted by independent organisations, following some common schema / using common tooling. This is more like the MediaWiki model, of which Wikipedia is just one instance. The downside of that is that although anyone can create their own catalog (wiki), there is no built-in global search across wikis, and so (5) & (6) ends up being managed by a separate centralized entity anyway (Google). (Though decentralized protocols like are used in the Fediverse could maybe help with that...)
Later this spring, Earthmover will be launching a free tier which will provide a catalog service for public Icechunk datasets. (Similar to GitHub's free tier for public git repos.) We think this will really help the community share and discover great datasets.
@TomNicholas @norlandrhagen what do you think about moving forward with making a virtual-zarr github organization? https://github.com/developmentseed/virtual-tiff is almost ready for use, which motivated me to transfer it to developmentseed's GitHub org. But now I'm realizing it might be better to start working from a common organization so that it's more likely to have community maintenance. Or could parsers all live in the zarr-developers organization?
I'm pretty agnostic as to which GH org the parsers live in. I think one argument against putting them into zarr-developers is how I keep having to ask @joshmoore to do things for me because I don't have full admin permissions across the zarr-developers organisation...
:+1: (Not that it bothers me nor that I wouldn't give Tom the rights if I could 🤷🏽)