Zarr 3 (and geozarr?)
Feature description
I don't see an issue to track Zarr 3 for GDAL. I'm new to Zarr, so I might be missing obvious things. Is there any planned work on Zarr 3? And how does GeoZarr relate and where is the OGC work on GeoZarr?
2025-Jan-09:
- GitHub release: https://github.com/zarr-developers/zarr-python/releases/tag/v3.0.0
- Release Announcement: https://zarr.dev/blog/zarr-python-3-release/
And GeoZarr is confusing https://github.com/zarr-developers/geozarr-spec:
Update 2023-01-20 - This spec is transitioning to community-based development in anticipation of submission as an OGC standard. It has been moved to the zarr-developers GitHub organization. Brianna Pagan (@briannapagan) has volunteered to coordinate the community development process.
From https://gdal.org/en/stable/drivers/raster/zarr.html:
The current implementation of Zarr V3 before GDAL 3.8 is incompatible with the latest evolutions of the Zarr V3 specification. GDAL 3.8 is compatible with the Zarr V3 specification at date 2023-May-7, and is not interoperable with Zarr V3 datasets produced by earlier GDAL versions.
Additional context
No response
GeoZarr Spec Steering Working Group meeting notes:
https://hackmd.io/t2DWpX1iQEWMKx1Fi4Px7A
I don't see an issue to track Zarr 3 for GDAL
There's a PR pending in https://github.com/OSGeo/gdal/pull/11787 to catch up with latest developments
And how does GeoZarr relate and where is the OGC work on GeoZarr?
I've also stopped following GeoZarr progress. This whole ecosystem is confusing, with other parallel initiatives such as Kerchunk / VirtualiZarr that tries to do "zarr things" but still re-using netCDF/HDF5 files.
Adding @mdsumner in CC
I've had good chats with GeoZarr folks and earthmover folks, specifically @briannapagan and @rabernat My take, and I think they agree, is that xarray PR 9543 is the crucial one, that needs to be adopted as a very general lazy coordinate facility in xarray, and then be supported by Zarr itself. GeoZarr is conceptually very simple, add a crs and transform for first 2 dims - but that's not very interesting if it's somehow confined to "geo". I don't know about prospects for xarray features to roll into Zarr *independently of python implementations" and if other things like kerchunk/VirtualiZarr will do so, but I think it's very important that they do.
I'm also not expert enough on the python scene and on multidim gdal, but I'm working very hard on that. π
Thanks for the feedback and sorry @rouault that you've encountered confusion. We've been working to try to find a standard that works for everyone, but the discussion has become a bit of a rabbit hole. At the crux of this is the question of NetCDF / CF explicit coordinate variables vs. the simpler GDAL raster data model with implicit coordinates based on geotransform. We haven't [yet] been able to converge on a single "right" way to do coordinates in GeoZarr. While GDAL supports both ways, the NetCDF / CF ecosystem doesn't seem to have a great answer on the "implicit coordinates" side. We would honestly welcome some input from experts such as yourself about how to move forward.
with other parallel initiatives such as Kerchunk / VirtualiZarr that tries to do "zarr things" but still re-using netCDF/HDF5 files.
You can think of these initiatives as orthogonal (not parallel). VirtualiZarr / Kerchunk is about retroactively "cloud optimizing" existing NetCDF / HDF5 files by producing an index which explicitly maps the location of every chunk within one (or more) files. Kind of like GDAL VRTs, but operating just at the raw array / chunk level. The main use case here is concatenating many individual level-3 NetCDF granules into a single virtual data cube. It says nothing about coordinates / metadata, although presumably data generated via this approach would conform to NetCDF conventions.
We would honestly welcome some input from experts such as yourself about how to move forward.
"at the beginning I knew I knew nothing, then I believed that I knew, now I know there is nothing to know" (according to ChatGPT it might be a "mix of Socratic philosophy and Zen Buddhist ideas, which challenge the nature of knowledge itself")
That being said, I can see 2 main orientations:
-
you want to be able to fully replace netCDF usages with a 1-1 mapping. Then, easy, GeoZarr = adopting CF conventions unmodified (and potentially work with the group maintaining CF conventions to add support for "implicit coordinates"). At least this will make the life of some implementers easier (probably just the software doing the conversion)
-
or you target more restricted, but common, scenarios of regularly spaced grids and create conventions that just fit those. Something like
"dimension_names": [ "t", "y", "x" ], # Standard Zarr V3
"attributes": {
"geozarr": {
"crs": { ... PROJJSON of WGS84 / UTM 31N ... },
"crs_axis_to_dimension_mapping": [ "x", "y" ] # meaning the first axis of the CRS maps to the "x" Zarr dimension
"dimension_values": {
"t": { "start": 1739020000, "step": 3600, "standard_name": "time", "units": "seconds since 1970-01-01" },
"y": { "start": 500000, "step": 10, "standard_name": "northing", "units": "metre" },
"x": { "start": 4550000, "step": 10, "standard_name": "easting", "units": "metre" },
}
}
}
You could certainly complicate things and allow irregularly spaced grids by having something like "t": { "variable": "/path/to/t" } or perhaps inlined ? "t": { "values": [ 1739020000, 1739021000, 1739022500 ], "standard_name": "time", "units": "seconds since 1970-01-01"}
For completeness, and for true standard believers, I should mention that OGC has invented its own georeferencing convention for netCDF in the GGXF standard, but I lack objectivity to appreciate if it is appropriate or not for use outside the scope of GGXF.
With Esri GGXF, I have a bit of the same problem as when I tried to read the HDF5 spec. Their specs are so complex, with too many features (some of them very specific) that it is impossible to keep up with them. Maybe that level of complexity is required by some users/companies, but for normal users π it is too much.
I think what makes Zarr-spec attractive and special is that its spec is relatively easy to understand. Following the same philosophy, using something that we all already know (CF convention) + a bit stricter CRS definition (PROJJSON) + time convention would be ideal.
For someone interested, CF-paper: https://gmd.copernicus.org/articles/10/4619/2017/gmd-10-4619-2017.pdf GGXF: https://docs.ogc.org/is/22-051r7/22-051r7.pdf
Also, there is a Python/Rust Zarr implementation from @LDeakin that already supports V3/V2 and in my experience works fantastic.
zarrs - https://github.com/LDeakin/zarrs zarrs[python] - https://github.com/ilan-gold/zarrs-python
We (Esri) believe a 1:1 mapping is necessary as many users convert existing CF-1/netCDF datasets to Zarr. ArcGIS allows CF conventions in Zarr and will process grid_mapping, coordinates, spacing, etc., as we do for netCDF.
I guess we could reuse the old DODS driver for pulling in virtualization manifests? Just off-cuff idea as I think of it. The real issue is this community being bound to one language, we need some kind of assurance that the Zarr standard wants to include the virtualization schemes in json and parquet (and anything else? DMR++ ?)
That's why I settle on the Rust library zarrs, it seems to have the most up to date and comprehensive support. But, capability wise I'm focusing on GDAL multidim for now so I can navigate the scene better, especially from R.
I was wrong, GDAL does work with the virtualization:
"ZARR:\"/vsicurl/https://projects.pawsey.org.au/vzarr/oisst-avhrr-v02r01.parquet\""
I'll try to unpick the mistakes I stated above, I'm working on examples.
I've been told that GDAL should establish the one best way to encode transform and crs in Zarr ... and similar machinations played out with geotiff in previous years. Not really sure what to do with that, I personally think it's the Zarr specification writers that have this responsibility, and being held up as a "format of the future" carries a pretty significant responsibility to also ensure it's able to represent existing formats.
But, there's a fairly clear adoption of STAC-like representations, i.e. all of the ESA/eopf examples here use this form, specifically this file:
https://eopf-public.s3.sbg.perf.cloud.ovh.net/eoproducts/S02MSIL1C_20230629T063559_0000_A064_T3A5.zarr.zip
contains
S02MSIL1C_20230629T063559_0000_A064_T3A5.zarr/quality/l1c_quicklook/r10m/tci/.zattrs
which includes:
{
"_ARRAY_DIMENSIONS": [
"band",
"y",
"x"
],
"long_name": "TCI: True Color Image",
"proj:bbox": [
300000.0,
4490220.0,
409800.0,
4600020.0
],
"proj:epsg": 32636,
"proj:shape": [
10980,
10980
],
"proj:transform": [
10.0,
0.0,
300000.0,
0.0,
-10.0,
4600020.0,
0.0,
0.0,
1.0
],
"proj:wkt2": "PROJCS[\"WGS 84 / UTM zone 36N\",GEOGCS[\"WGS 84\",DATUM[\"WGS_1984\",SPHEROID[\"WGS 84\",6378137,298.257223563,AUTHORITY[\"EPSG\",\"7030\"]],AUTHORITY[\"EPSG\",\"6326\"]],PRIMEM[\"Greenwich\",0,AUTHORITY[\"EPSG\",\"8901\"]],UNIT[\"degree\",0.0174532925199433,AUTHORITY[\"EPSG\",\"9122\"]],AUTHORITY[\"EPSG\",\"4326\"]],PROJECTION[\"Transverse_Mercator\"],PARAMETER[\"latitude_of_origin\",0],PARAMETER[\"central_meridian\",33],PARAMETER[\"scale_factor\",0.9996],PARAMETER[\"false_easting\",500000],PARAMETER[\"false_northing\",0],UNIT[\"metre\",1,AUTHORITY[\"EPSG\",\"9001\"]],AXIS[\"Easting\",EAST],AXIS[\"Northing\",NORTH],AUTHORITY[\"EPSG\",\"32636\"]]"
}
I personally find GeoZarr a distraction in this space, because it so clearly belongs in the core specification for a format, all data including coordinates needs metadata about what it is (it can be empty but non-existent is very poor imo), and being able to declare upfront that a particular set of axes is regular is obviously important, you can still support legacy systems with degenerate coordinate arrays while having transform for the common regular case.
I don't have the expertise to drive what should happen here, but this seems like a defacto standard emerging and GDAL could support it.
FWIW, there's a Zarr/eopf discussion planned for 27 March:
https://us02web.zoom.us/webinar/register/WN_MzGnEYQKSc6zGYOchzCznQ#/registration
and apparently one for Zarr+STAC on 17 April which I'll share when I find details.
Just discovered this is an actual driver in development:
https://github.com/EOPF-Sample-Service/GDAL-ZARR-EOPF
Worth exploring π
https://github.com/EOPF-Sample-Service/GDAL-ZARR-EOPF
At time of writing https://github.com/EOPF-Sample-Service/GDAL-ZARR-EOPF/tree/main/src is mostly a skeleton
I'll PR in the crs and gt handling, looks like this will be a defacto standard. Also I'll push harder at Zarr spec itself to capture this
I was just wondering about this today - if you had some reasonable number of terabytes of generic rasters - how can you start approaching this sort of icechunk-esque speedup without data copy/transformation.
VirtualiZarr is the way, don't necessarily need Icechunk. Check out Tom's in-dev open_virtual_mfdataset which is the xarray analog for open_mfdataset. I'm not clear yet on how specific format is for the Icechunk Rust implementation or whether any of this is considered important to write into a specification.
Oh, and multidim VRT is also perfectly capable, you still use the format library but this is close as an important bridge imo
https://github.com/mdsumner/pymdim/blob/main/inst/examples/create_vrt_mdim.R (the VRT is beside the script here, my plan is to make gdalmdimtranslate able to input multiple sources and resolve them to VRT like this)
Sorry if it is a bit off topic.
I was just wondering about this today - if you had some reasonable number of terabytes of generic rasters - how can you start approaching this sort of icechunk-esque speedup without data copy/transformation.
Just do aggressive catching. Kerchunk or VirtualZarr expose offset/length of the internal chunks, and create βVirtualβ Zarr stores using the familiar and friendly xarray syntax. The problem is that they need to maintain readers for each format; currently, both support netCDF4, netCDF3, and TIFF (experimental).
A similar approach can be applied with GeoParquet & COG, as demonstrated by rasteret. My impression is that STAC-GeoParquet could offer a more interoperable solution, as each row in a GeoParquet file represents a raster/cube (STAC Item) with multiple assets, each is a pointer to different chunks within the raster/cube.
That said, this looks to me more closely related to data providers' access optimization rather than something that could be fully addressed within GDAL. For instance, in the Sentinel-2 L1C Google bucket:
To help locate data of interest, an index CSV file of the sentinel-2 data is available. This CSV file lists the available granules, their acquisition dates, and their spatial extent as minimum and maximum latitudes and longitudes. The file is found in the Sentinel-2 Cloud Storage bucket: gs://gcp-public-data-sentinel-2/index.csv.gz
VirtualiZarr is the way, don't necessarily need Icechunk. Check out Tom's in-dev open_virtual_mfdataset which is the xarray analog for open_mfdataset. I'm not clear yet on how specific format is for the Icechunk Rust implementation or whether any of this is considered important to write into a specification.
Right - have you tried recent updates? Did look at kerchunk and geotiff files a couple of years ago.
VirtualiZarr is the way, don't necessarily need Icechunk.
If you want to persist the indexes you generate from VirtualiZarr, you need some way to serialize to disk. Icechunk was purpose built for this.
Don't think I would want to make millions multiple times anyway.
VirtualiZarr is the way, don't necessarily need Icechunk. Check out Tom's in-dev open_virtual_mfdataset which is the xarray analog for open_mfdataset. I'm not clear yet on how specific format is for the Icechunk Rust implementation or whether any of this is considered important to write into a specification.
I saw a few PRs - open / closed / in progress etc. - is there one in particular you are working with @mdsumner - e.g. who has some tiff tests recently - which I am happy to try, but it someone has started, good to borrow theirs first!
https://github.com/zarr-developers/VirtualiZarr/issues/526
I have never managed this with a tif, still confused about how to get that to work. I was using the latest dev of VirtualiZarr recently with netcdf:
https://github.com/mdsumner/pawsey-aad/issues/5
Previously was using open_virtual_dataset with dask concat here:
https://github.com/mdsumner/pawsey-aad/blob/main/acacia/oisst-virtualize.py
I actually find it hard to find any examples that are just generating and serializing references so appreciate any shares here π
It is fast, terra in R is really fast with GDAL and just a simple comparison (doing my own range read with readBin and memDecompress in R) was easily 4x as fast to read simple nc files. (But I'm still caught in workflow complications so not much to share yet, I'm particularly interested in shortcutting GDAL Zarr to these references so we can use the full API.
I just tried this example pretty much
https://projectpythia.org/kerchunk-cookbook/notebooks/generating_references/GeoTIFF.html
With own private data e.g. anon is false give key and secret.
The s3fs stuff is fine as you would expect.
Not delved in virtualizarr code to find out where it is getting confused in hdf5 / tiffland space.
recording on the GeoZARR SWG meeting: https://fathom.video/share/XA84x7SU9-KwNo8gsN8muti1sQhw4y31 and full transcript
was announced in this https://github.com/zarr-developers/geozarr-spec/issues/62#event-17115544959
(I couldn't attend sadly)
small feedback on the EOPF/ZARR geozarr format
this description
"ZARR:\"/vsizip//vsicurl/https://eopf-public.s3.sbg.perf.cloud.ovh.net/eoproducts/S02MSIL1C_20230629T063559_0000_A064_T3A5.zarr.zip\":/quality/l1c_quicklook/r10m/tci"
gets the right georeferencing via gdal raster info because here GuessGeoTransform() analyses the X,Y arrays and detects that it's a (degenerate) rectilinear description
https://github.com/OSGeo/gdal/blob/f401ea86157d915e5f178bd411031b472250ffd5/gcore/gdalmultidim.cpp#L7558C26-L7558C45
so we check what's in the .zmetadata
gdal vsi copy /vsizip//vsicurl/https://eopf-public.s3.sbg.perf.cloud.ovh.net/eoproducts/S02MSIL1C_20230629T063559_0000_A064_T3A5.zarr.zip/.zmetadata my.zmetadata
and see that everything we need is there
grep "quality/l1c_quicklook/r10m/tci/.zattrs" my.zmetadata -A 30
"quality/l1c_quicklook/r10m/tci/.zattrs": {
"_ARRAY_DIMENSIONS": [
"band",
"y",
"x"
],
"long_name": "TCI: True Color Image",
"proj:bbox": [
300000.0,
4490220.0,
409800.0,
4600020.0
],
"proj:epsg": 32636,
"proj:shape": [
10980,
10980
],
"proj:transform": [
10.0,
0.0,
300000.0,
0.0,
-10.0,
4600020.0,
0.0,
0.0,
1.0
],
"proj:wkt2": "PROJCS[\"WGS 84 / UTM zone 36N\",GEOGCS[\"WGS 84\",DATUM[\"WGS_1984\",SPHEROID[\"WGS 84\",6378137,298.257223563,AUTHORITY[\"EPSG\",\"7030\"]],AUTHORITY[\"EPSG\",\"6326\"]],PRIMEM[\"Greenwich\",0,AUTHORITY[\"EPSG\",\"8901\"]],UNIT[\"degree\",0.0174532925199433,AUTHORITY[\"EPSG\",\"9122\"]],AUTHORITY[\"EPSG\",\"4326\"]],PROJECTION[\"Transverse_Mercator\"],PARAMETER[\"latitude_of_origin\",0],PARAMETER[\"central_meridian\",33],PARAMETER[\"scale_factor\",0.9996],PARAMETER[\"false_easting\",500000],PARAMETER[\"false_northing\",0],UNIT[\"metre\",1,AUTHORITY[\"EPSG\",\"9001\"]],AXIS[\"Easting\",EAST],AXIS[\"Northing\",NORTH],AUTHORITY[\"EPSG\",\"32636\"]]"
},
I appreciate that it's a bit of a disaster that the Zarr spec (or any spec) doesn't appear to declare this , and also it's entirely weird to have this comparct form of georeferencing while also having the legacy-style X,Y degenerate rectlinear arrays. This is not a GDAL failure, but I would like to get this supported as a more explicit and clear way of georeferencing Zarr.
I appreciate that it's a bit of a disaster that the Zarr spec (or any spec) doesn't appear to declare this , and also it's entirely weird to have this comparct form of georeferencing while also having the legacy-style X,Y degenerate rectlinear arrays. This is not a GDAL failure, but I would like to get this supported as a more explicit and clear way of georeferencing Zarr.
cf the mailing list exchange about a future EOPF driver to accommodate this: https://lists.osgeo.org/pipermail/gdal-dev/2025-April/thread.html#60414
Some interesting tidbits:
A take on CF/tif/Zarr harmonization
https://github.com/zarr-developers/geozarr-spec/pull/67
Response to PR for EOPF-aligned CRS+transform (which, I consider a todo for me):
https://github.com/EOPF-Sample-Service/GDAL-ZARR-EOPF/issues/20#issuecomment-2890251902