Awesome-Zarr
Awesome-Zarr copied to clipboard
🎀 Awesome Zarr resources
Zarr

Zarr is a cloud-native, chunked, compressed, and hierarchical array data format.
Contents
Resources
- Existing resources
- Introductory videos
- Zarr V3
- Libraries
- Platforms
- Articles
- Talks & Videos
- Life sciences
Topics
- Zarr & other array data formats
- GeoZarr
- Zarr & STAC
Resources
Existing resources
The Zarr website is already an excellent resource for learning about Zarr and its ecosystem. This list is intended to complement the website with a curated and opinionated list of resources.
This list focuses on Geo/Earth Sciences, but is not limited to that domain.
Existing lists
Lists
- The Zarr website already contains great lists: Zarr Implementations, Zarr Datasets, Zarr metadata conventions
- Zarr tutorials (zarr-developers/tutorials)
- Projects using Zarr (zarr-developers/community#19)
- Beautiful Zarr (zarr-developers/beautiful-zarr)
- See playlists & lists in Talks & Videos
Introductory videos
Introductory talks Youtube playlist
Two excellent and up-to-date introductory talks:
Zarr V3
Zarr V3 is the upcoming version of Zarr. It is a major update that will bring many new features and improvements.
If you're getting into Zarr now, it might be a good idea to start with Zarr V3.
For an excellent in-depth overview, see the ESIP series of talks
- 2023-03-27 ESIP Cloud Computing Cluster: Zarr - The Next Generation
- 2023-04-24 ESIP Cloud Computing Cluster: Next Generation of Zarr Part 2/3 GeoZarr and Zarr Sharding
- 2023-05-22 ESIP CCC: Next Gen Zarr Part 3/3: accumulation proposal, Kerchunk and Pangeo-Forge
Libraries
This list contains libraries that directly relate to Zarr in some way.
For implementations of Zarr, see Zarr Implementations.
- kerchunk, see kerchunk section
-
xpublish: Exposing as and consuming Zarr through a REST API
- See also routers at xpublish-community, e.g. xpublish-opendap
- Improving Access to NOAA NOS Model Data with Kerchunk and Xpublish
- ndpyramid: utility for generating ND array pyramids using Xarray and Zarr
Storage & I/O
- Tensorstore and xarray-tensorstore: library for efficiently reading and writing large multi-dimensional arrays, has Zarr API
- KivkIO: C++ and Python bindings to cuFile, enabling GPUDirect Storage
- rechunker: disk-to-disk transformation for chunked arrays
- xpartition: writing large xarray datasets to Zarr. Works around shortcomings of Dask (distributed#6360)
ETL
-
Xarray: Zarr is commonly written and accessed through xarray's API.
- Xarray has its own Zarr Encoding Specification
- xarray-beam: Integration of xarray and Apache Beam built using Zarr.
- Pangeo-forge: Open-source data platform for transforming datasets into analysis-ready cloud-optimized formats.
Developer-oriented
- numcodecs: Compression and transformation codecs used by Zarr
- pydantic-zarr: Pydantic models for Zarr objects
- traverzarr: Traversing Zarr JSON as if it's a filesystem
- zarr_checksum: Calculating checksum information form Zarr
- zarrdump: Describe zarr stores from the command line
Visualization: For tools & libraries for visualization, see visualization section
Kerchunk
Kerchunk allows you to efficiently read chunked data formats such as GRID, NetCDF, COGs by exposing them as a Zarr store.
Talks and tutorials
- All you need is Zarr
- 2022 ESIP Kerchunk Tutorial
- Accessing NetCDF and GRIB file collections as cloud-native virtual datasets using Kerchunk
Future of Kerchunk
In the future, Kerchunk will be split into upstream functionality in Zarr itself and a new VirtualiZarr package.
- Kerchunk JSON references will become a part of the Chunk manifest
- For a full overview, see Upstreaming Kerchunk
- What's Next for Kerchunk
Platforms
- Arraylake: a data lake platform based on Zarr. The company, Earthmover was started by core Zarr developers.
Articles
- NASA IMPACT: Zarr Visualization Report
- Earthmover: cloud-native data loaders for machine learning using zarr and xarray
- Zarr Sprint Recap relevant overviews
Talks & Videos
Existing lists
- Zarr Developers playlists, namely
- Zarr Talks
- Introductory videos in this list
Talks
- Earthmover Webinar: Building a Planetary Scale Earth Observation Data Cube in Zarr with code repository and slides
- Earthmover Webinar: Analysis-ready Weather Forecast Data Cubes with Zarr with code repository and slides
- Presentation | Zarr: Community specification of large, cloud-optimised, N-dimensional, typed array storage
- Presentations for Sanket Verma's talks: SciPy 2023 and PyCon DE 2023
Life sciences
Zarr has seen great adoption in the life sciences domain.
- bdz: Zarr-based format for storing quantitative biosystems dynamics data
- ome-zarr-py: Implementation of next-generation file format (NGFF) specifications for storing bioimaging data in the cloud.
- ez_zarr: Easy, high-level access to OME-Zarr filesets
- hdmf-zarr: Zarr I/O backend for HDMF
Talks and resources
- Zarr | Life Science Lightning Talk | Trevor Manz | Dask Summit 2021
- Accelerating Single-cell Bioinformatics with N-dimensional Arrays in the Cloud | ISMMS
- What are next-generation file formats (NGFF)?
Visualization
Zarr has seen most work on visualization in the bioimaging community:
- List: Image viewers with OME-Zarr support
- WEBKNOSSOS: web-based visualization & annotation tool, supports OME-Zarr
- Napari: interactive viewer
- Vizarr: interactive viewer built using viv (OME-Zarr and OME-TIFF)
- Neuroglancer: WebGL-based viewer for volumetric data
- BigDataViewer
Topics
Zarr & other array data formats
For a general overview, see
Essentially all other common array data formats can be exposed as Zarr. See Kerchunk.
NetCDF & HDF5
Zarr, NetCDF, and HDF5 are three separate data formats that nonetheless relate to each other in multiple ways.
- Zarr inherits its hierarchical structure from HDF5.
- Zarr is commonly accessed through xarray, whose data models are based on the NetCDF data format
- NetCDF4 can use HDF5 as a backend
- NCZarr is an extension of the Zarr format to map it to a subset of the NetCDF data model.
Resources
- A Comparison of HDF5, Zarr, and netCDF4 in Performing Common I/O Operations HDF5
- Pangeo: HDF5 at the speed of Zarr
- Joe Jevnik: Zarr vs. HDF5 | PyData New York 2019
COG: Cloud-Optimized GeoTIFF
N5
Zarr and N5 are two similar array data formats that share common goals and development.
The Zarr V3 spec aims to provide a common implementation target (sources: 1, 2)
Links
- n5
- zarr.n5
- z5: C++ and Python interface for datasets in zarr and n5 format
- Zarr N5 spec diff (zarr-specs#3)
GeoZarr
GeoZarr is a proposal for a Zarr-based geospatial data format, being submitted as an OGC standard
GeoZarr will define a metadata convention for Zarr stores that contain geospatial data.
It will also define the relationship of Zarr with CF and NetCDF
Links
Zarr & STAC
STAC provides a common structure for describing and cataloging spatiotemporal assets.
With its hierarchical structure and key-value metadata support, Zarr's capabilities overlap significantly with STAC.
The communities have not yet converged on a canonical representation of Zarr datasets through STAC.
Today, a good example of exposing Zarr in STAC is Planetary Computer
- Reading Zarr Data
- STAC collection: Daymet Annual North America
- STAC collection: CIL Global Downscaled Projections for Climate Impacts Research
- xstac: STAC from xarray
- Related STAC extensions: xarray-assets, datacube
More discussion & Related links
- Pangeo: Metadata duplication on STAC zarr collections
- geozarr-spec#32: Integration of Zarr with STAC Catalogs
- stac-spec#781: Zarr Extension?
- Tom Augspurper: STAC and Kerchunk
- Presentation | Daniel Jahn – STAC vs Zarr
- Arraylake a data lake platform that is arguably the first example of a pure Zarr data catalog
In the future, the Zarr V3 Spec and GeoZarr convention will likely enable greater interoperability between STAC and Zarr.