gdal icon indicating copy to clipboard operation
gdal copied to clipboard

Zarr 3 (and geozarr?)

Open schwehr opened this issue 10 months ago β€’ 27 comments

Feature description

I don't see an issue to track Zarr 3 for GDAL. I'm new to Zarr, so I might be missing obvious things. Is there any planned work on Zarr 3? And how does GeoZarr relate and where is the OGC work on GeoZarr?

2025-Jan-09:

  • GitHub release: https://github.com/zarr-developers/zarr-python/releases/tag/v3.0.0
  • Release Announcement: https://zarr.dev/blog/zarr-python-3-release/

And GeoZarr is confusing https://github.com/zarr-developers/geozarr-spec:

Update 2023-01-20 - This spec is transitioning to community-based development in anticipation of submission as an OGC standard. It has been moved to the zarr-developers GitHub organization. Brianna Pagan (@briannapagan) has volunteered to coordinate the community development process.

From https://gdal.org/en/stable/drivers/raster/zarr.html:

The current implementation of Zarr V3 before GDAL 3.8 is incompatible with the latest evolutions of the Zarr V3 specification. GDAL 3.8 is compatible with the Zarr V3 specification at date 2023-May-7, and is not interoperable with Zarr V3 datasets produced by earlier GDAL versions.

Additional context

No response

schwehr avatar Feb 08 '25 01:02 schwehr

GeoZarr Spec Steering Working Group meeting notes:

https://hackmd.io/t2DWpX1iQEWMKx1Fi4Px7A

schwehr avatar Feb 08 '25 01:02 schwehr

I don't see an issue to track Zarr 3 for GDAL

There's a PR pending in https://github.com/OSGeo/gdal/pull/11787 to catch up with latest developments

And how does GeoZarr relate and where is the OGC work on GeoZarr?

I've also stopped following GeoZarr progress. This whole ecosystem is confusing, with other parallel initiatives such as Kerchunk / VirtualiZarr that tries to do "zarr things" but still re-using netCDF/HDF5 files.

Adding @mdsumner in CC

rouault avatar Feb 08 '25 01:02 rouault

I've had good chats with GeoZarr folks and earthmover folks, specifically @briannapagan and @rabernat My take, and I think they agree, is that xarray PR 9543 is the crucial one, that needs to be adopted as a very general lazy coordinate facility in xarray, and then be supported by Zarr itself. GeoZarr is conceptually very simple, add a crs and transform for first 2 dims - but that's not very interesting if it's somehow confined to "geo". I don't know about prospects for xarray features to roll into Zarr *independently of python implementations" and if other things like kerchunk/VirtualiZarr will do so, but I think it's very important that they do.

I'm also not expert enough on the python scene and on multidim gdal, but I'm working very hard on that. πŸ™

mdsumner avatar Feb 08 '25 03:02 mdsumner

Thanks for the feedback and sorry @rouault that you've encountered confusion. We've been working to try to find a standard that works for everyone, but the discussion has become a bit of a rabbit hole. At the crux of this is the question of NetCDF / CF explicit coordinate variables vs. the simpler GDAL raster data model with implicit coordinates based on geotransform. We haven't [yet] been able to converge on a single "right" way to do coordinates in GeoZarr. While GDAL supports both ways, the NetCDF / CF ecosystem doesn't seem to have a great answer on the "implicit coordinates" side. We would honestly welcome some input from experts such as yourself about how to move forward.

with other parallel initiatives such as Kerchunk / VirtualiZarr that tries to do "zarr things" but still re-using netCDF/HDF5 files.

You can think of these initiatives as orthogonal (not parallel). VirtualiZarr / Kerchunk is about retroactively "cloud optimizing" existing NetCDF / HDF5 files by producing an index which explicitly maps the location of every chunk within one (or more) files. Kind of like GDAL VRTs, but operating just at the raw array / chunk level. The main use case here is concatenating many individual level-3 NetCDF granules into a single virtual data cube. It says nothing about coordinates / metadata, although presumably data generated via this approach would conform to NetCDF conventions.

rabernat avatar Feb 08 '25 04:02 rabernat

We would honestly welcome some input from experts such as yourself about how to move forward.

"at the beginning I knew I knew nothing, then I believed that I knew, now I know there is nothing to know" (according to ChatGPT it might be a "mix of Socratic philosophy and Zen Buddhist ideas, which challenge the nature of knowledge itself")

That being said, I can see 2 main orientations:

  • you want to be able to fully replace netCDF usages with a 1-1 mapping. Then, easy, GeoZarr = adopting CF conventions unmodified (and potentially work with the group maintaining CF conventions to add support for "implicit coordinates"). At least this will make the life of some implementers easier (probably just the software doing the conversion)

  • or you target more restricted, but common, scenarios of regularly spaced grids and create conventions that just fit those. Something like

"dimension_names": [ "t", "y", "x" ],      #  Standard Zarr V3
"attributes": {
    "geozarr": {
        "crs": { ... PROJJSON of WGS84 / UTM 31N ... },
        "crs_axis_to_dimension_mapping": [ "x", "y" ]     # meaning the first axis of the CRS maps to the "x" Zarr dimension
        "dimension_values": {
            "t": { "start": 1739020000, "step": 3600, "standard_name": "time", "units": "seconds since 1970-01-01" },
            "y": { "start": 500000, "step": 10, "standard_name": "northing", "units": "metre" },
            "x": { "start": 4550000, "step": 10, "standard_name": "easting", "units": "metre" },
        }
    }
}

You could certainly complicate things and allow irregularly spaced grids by having something like "t": { "variable": "/path/to/t" } or perhaps inlined ? "t": { "values": [ 1739020000, 1739021000, 1739022500 ], "standard_name": "time", "units": "seconds since 1970-01-01"}

For completeness, and for true standard believers, I should mention that OGC has invented its own georeferencing convention for netCDF in the GGXF standard, but I lack objectivity to appreciate if it is appropriate or not for use outside the scope of GGXF.

rouault avatar Feb 08 '25 13:02 rouault

With Esri GGXF, I have a bit of the same problem as when I tried to read the HDF5 spec. Their specs are so complex, with too many features (some of them very specific) that it is impossible to keep up with them. Maybe that level of complexity is required by some users/companies, but for normal users πŸ™‹ it is too much.

I think what makes Zarr-spec attractive and special is that its spec is relatively easy to understand. Following the same philosophy, using something that we all already know (CF convention) + a bit stricter CRS definition (PROJJSON) + time convention would be ideal.

For someone interested, CF-paper: https://gmd.copernicus.org/articles/10/4619/2017/gmd-10-4619-2017.pdf GGXF: https://docs.ogc.org/is/22-051r7/22-051r7.pdf

Also, there is a Python/Rust Zarr implementation from @LDeakin that already supports V3/V2 and in my experience works fantastic.

zarrs - https://github.com/LDeakin/zarrs zarrs[python] - https://github.com/ilan-gold/zarrs-python

csaybar avatar Feb 08 '25 16:02 csaybar

We (Esri) believe a 1:1 mapping is necessary as many users convert existing CF-1/netCDF datasets to Zarr. ArcGIS allows CF conventions in Zarr and will process grid_mapping, coordinates, spacing, etc., as we do for netCDF.

rprinceley avatar Feb 11 '25 07:02 rprinceley

I guess we could reuse the old DODS driver for pulling in virtualization manifests? Just off-cuff idea as I think of it. The real issue is this community being bound to one language, we need some kind of assurance that the Zarr standard wants to include the virtualization schemes in json and parquet (and anything else? DMR++ ?)

That's why I settle on the Rust library zarrs, it seems to have the most up to date and comprehensive support. But, capability wise I'm focusing on GDAL multidim for now so I can navigate the scene better, especially from R.

mdsumner avatar Feb 11 '25 23:02 mdsumner

I was wrong, GDAL does work with the virtualization:

"ZARR:\"/vsicurl/https://projects.pawsey.org.au/vzarr/oisst-avhrr-v02r01.parquet\"" 

I'll try to unpick the mistakes I stated above, I'm working on examples.

mdsumner avatar Feb 19 '25 03:02 mdsumner

I've been told that GDAL should establish the one best way to encode transform and crs in Zarr ... and similar machinations played out with geotiff in previous years. Not really sure what to do with that, I personally think it's the Zarr specification writers that have this responsibility, and being held up as a "format of the future" carries a pretty significant responsibility to also ensure it's able to represent existing formats.

But, there's a fairly clear adoption of STAC-like representations, i.e. all of the ESA/eopf examples here use this form, specifically this file:

https://eopf-public.s3.sbg.perf.cloud.ovh.net/eoproducts/S02MSIL1C_20230629T063559_0000_A064_T3A5.zarr.zip

contains

S02MSIL1C_20230629T063559_0000_A064_T3A5.zarr/quality/l1c_quicklook/r10m/tci/.zattrs

which includes:

{
    "_ARRAY_DIMENSIONS": [
        "band",
        "y",
        "x"
    ],
    "long_name": "TCI: True Color Image",
    "proj:bbox": [
        300000.0,
        4490220.0,
        409800.0,
        4600020.0
    ],
    "proj:epsg": 32636,
    "proj:shape": [
        10980,
        10980
    ],
    "proj:transform": [
        10.0,
        0.0,
        300000.0,
        0.0,
        -10.0,
        4600020.0,
        0.0,
        0.0,
        1.0
    ],
    "proj:wkt2": "PROJCS[\"WGS 84 / UTM zone 36N\",GEOGCS[\"WGS 84\",DATUM[\"WGS_1984\",SPHEROID[\"WGS 84\",6378137,298.257223563,AUTHORITY[\"EPSG\",\"7030\"]],AUTHORITY[\"EPSG\",\"6326\"]],PRIMEM[\"Greenwich\",0,AUTHORITY[\"EPSG\",\"8901\"]],UNIT[\"degree\",0.0174532925199433,AUTHORITY[\"EPSG\",\"9122\"]],AUTHORITY[\"EPSG\",\"4326\"]],PROJECTION[\"Transverse_Mercator\"],PARAMETER[\"latitude_of_origin\",0],PARAMETER[\"central_meridian\",33],PARAMETER[\"scale_factor\",0.9996],PARAMETER[\"false_easting\",500000],PARAMETER[\"false_northing\",0],UNIT[\"metre\",1,AUTHORITY[\"EPSG\",\"9001\"]],AXIS[\"Easting\",EAST],AXIS[\"Northing\",NORTH],AUTHORITY[\"EPSG\",\"32636\"]]"
}

I personally find GeoZarr a distraction in this space, because it so clearly belongs in the core specification for a format, all data including coordinates needs metadata about what it is (it can be empty but non-existent is very poor imo), and being able to declare upfront that a particular set of axes is regular is obviously important, you can still support legacy systems with degenerate coordinate arrays while having transform for the common regular case.

I don't have the expertise to drive what should happen here, but this seems like a defacto standard emerging and GDAL could support it.

FWIW, there's a Zarr/eopf discussion planned for 27 March:

https://us02web.zoom.us/webinar/register/WN_MzGnEYQKSc6zGYOchzCznQ#/registration

and apparently one for Zarr+STAC on 17 April which I'll share when I find details.

mdsumner avatar Mar 16 '25 23:03 mdsumner

Just discovered this is an actual driver in development:

https://github.com/EOPF-Sample-Service/GDAL-ZARR-EOPF

Worth exploring πŸ™

mdsumner avatar Mar 28 '25 03:03 mdsumner

https://github.com/EOPF-Sample-Service/GDAL-ZARR-EOPF

At time of writing https://github.com/EOPF-Sample-Service/GDAL-ZARR-EOPF/tree/main/src is mostly a skeleton

rouault avatar Mar 28 '25 13:03 rouault

I'll PR in the crs and gt handling, looks like this will be a defacto standard. Also I'll push harder at Zarr spec itself to capture this

mdsumner avatar Mar 28 '25 22:03 mdsumner

I was just wondering about this today - if you had some reasonable number of terabytes of generic rasters - how can you start approaching this sort of icechunk-esque speedup without data copy/transformation.

RichardScottOZ avatar Mar 30 '25 08:03 RichardScottOZ

VirtualiZarr is the way, don't necessarily need Icechunk. Check out Tom's in-dev open_virtual_mfdataset which is the xarray analog for open_mfdataset. I'm not clear yet on how specific format is for the Icechunk Rust implementation or whether any of this is considered important to write into a specification.

mdsumner avatar Mar 30 '25 09:03 mdsumner

Oh, and multidim VRT is also perfectly capable, you still use the format library but this is close as an important bridge imo

https://github.com/mdsumner/pymdim/blob/main/inst/examples/create_vrt_mdim.R (the VRT is beside the script here, my plan is to make gdalmdimtranslate able to input multiple sources and resolve them to VRT like this)

mdsumner avatar Mar 30 '25 09:03 mdsumner

Sorry if it is a bit off topic.

I was just wondering about this today - if you had some reasonable number of terabytes of generic rasters - how can you start approaching this sort of icechunk-esque speedup without data copy/transformation.

Just do aggressive catching. Kerchunk or VirtualZarr expose offset/length of the internal chunks, and create β€œVirtual” Zarr stores using the familiar and friendly xarray syntax. The problem is that they need to maintain readers for each format; currently, both support netCDF4, netCDF3, and TIFF (experimental).

A similar approach can be applied with GeoParquet & COG, as demonstrated by rasteret. My impression is that STAC-GeoParquet could offer a more interoperable solution, as each row in a GeoParquet file represents a raster/cube (STAC Item) with multiple assets, each is a pointer to different chunks within the raster/cube.

That said, this looks to me more closely related to data providers' access optimization rather than something that could be fully addressed within GDAL. For instance, in the Sentinel-2 L1C Google bucket:

To help locate data of interest, an index CSV file of the sentinel-2 data is available. This CSV file lists the available granules, their acquisition dates, and their spatial extent as minimum and maximum latitudes and longitudes. The file is found in the Sentinel-2 Cloud Storage bucket: gs://gcp-public-data-sentinel-2/index.csv.gz

csaybar avatar Mar 30 '25 11:03 csaybar

VirtualiZarr is the way, don't necessarily need Icechunk. Check out Tom's in-dev open_virtual_mfdataset which is the xarray analog for open_mfdataset. I'm not clear yet on how specific format is for the Icechunk Rust implementation or whether any of this is considered important to write into a specification.

Right - have you tried recent updates? Did look at kerchunk and geotiff files a couple of years ago.

RichardScottOZ avatar Apr 01 '25 19:04 RichardScottOZ

VirtualiZarr is the way, don't necessarily need Icechunk.

If you want to persist the indexes you generate from VirtualiZarr, you need some way to serialize to disk. Icechunk was purpose built for this.

rabernat avatar Apr 01 '25 20:04 rabernat

Don't think I would want to make millions multiple times anyway.

RichardScottOZ avatar Apr 02 '25 00:04 RichardScottOZ

VirtualiZarr is the way, don't necessarily need Icechunk. Check out Tom's in-dev open_virtual_mfdataset which is the xarray analog for open_mfdataset. I'm not clear yet on how specific format is for the Icechunk Rust implementation or whether any of this is considered important to write into a specification.

I saw a few PRs - open / closed / in progress etc. - is there one in particular you are working with @mdsumner - e.g. who has some tiff tests recently - which I am happy to try, but it someone has started, good to borrow theirs first!

https://github.com/zarr-developers/VirtualiZarr/issues/526

RichardScottOZ avatar Apr 02 '25 00:04 RichardScottOZ

I have never managed this with a tif, still confused about how to get that to work. I was using the latest dev of VirtualiZarr recently with netcdf:

https://github.com/mdsumner/pawsey-aad/issues/5

Previously was using open_virtual_dataset with dask concat here:

https://github.com/mdsumner/pawsey-aad/blob/main/acacia/oisst-virtualize.py

I actually find it hard to find any examples that are just generating and serializing references so appreciate any shares here πŸ™

It is fast, terra in R is really fast with GDAL and just a simple comparison (doing my own range read with readBin and memDecompress in R) was easily 4x as fast to read simple nc files. (But I'm still caught in workflow complications so not much to share yet, I'm particularly interested in shortcutting GDAL Zarr to these references so we can use the full API.

mdsumner avatar Apr 02 '25 02:04 mdsumner

I just tried this example pretty much

https://projectpythia.org/kerchunk-cookbook/notebooks/generating_references/GeoTIFF.html

With own private data e.g. anon is false give key and secret.

The s3fs stuff is fine as you would expect.

Not delved in virtualizarr code to find out where it is getting confused in hdf5 / tiffland space.

RichardScottOZ avatar Apr 02 '25 02:04 RichardScottOZ

recording on the GeoZARR SWG meeting: https://fathom.video/share/XA84x7SU9-KwNo8gsN8muti1sQhw4y31 and full transcript

was announced in this https://github.com/zarr-developers/geozarr-spec/issues/62#event-17115544959

(I couldn't attend sadly)

mdsumner avatar Apr 04 '25 01:04 mdsumner

small feedback on the EOPF/ZARR geozarr format

this description

"ZARR:\"/vsizip//vsicurl/https://eopf-public.s3.sbg.perf.cloud.ovh.net/eoproducts/S02MSIL1C_20230629T063559_0000_A064_T3A5.zarr.zip\":/quality/l1c_quicklook/r10m/tci"

gets the right georeferencing via gdal raster info because here GuessGeoTransform() analyses the X,Y arrays and detects that it's a (degenerate) rectilinear description

https://github.com/OSGeo/gdal/blob/f401ea86157d915e5f178bd411031b472250ffd5/gcore/gdalmultidim.cpp#L7558C26-L7558C45

so we check what's in the .zmetadata

gdal vsi copy /vsizip//vsicurl/https://eopf-public.s3.sbg.perf.cloud.ovh.net/eoproducts/S02MSIL1C_20230629T063559_0000_A064_T3A5.zarr.zip/.zmetadata my.zmetadata

and see that everything we need is there

grep "quality/l1c_quicklook/r10m/tci/.zattrs" my.zmetadata  -A 30
        "quality/l1c_quicklook/r10m/tci/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "band",
                "y",
                "x"
            ],
            "long_name": "TCI: True Color Image",
            "proj:bbox": [
                300000.0,
                4490220.0,
                409800.0,
                4600020.0
            ],
            "proj:epsg": 32636,
            "proj:shape": [
                10980,
                10980
            ],
            "proj:transform": [
                10.0,
                0.0,
                300000.0,
                0.0,
                -10.0,
                4600020.0,
                0.0,
                0.0,
                1.0
            ],
            "proj:wkt2": "PROJCS[\"WGS 84 / UTM zone 36N\",GEOGCS[\"WGS 84\",DATUM[\"WGS_1984\",SPHEROID[\"WGS 84\",6378137,298.257223563,AUTHORITY[\"EPSG\",\"7030\"]],AUTHORITY[\"EPSG\",\"6326\"]],PRIMEM[\"Greenwich\",0,AUTHORITY[\"EPSG\",\"8901\"]],UNIT[\"degree\",0.0174532925199433,AUTHORITY[\"EPSG\",\"9122\"]],AUTHORITY[\"EPSG\",\"4326\"]],PROJECTION[\"Transverse_Mercator\"],PARAMETER[\"latitude_of_origin\",0],PARAMETER[\"central_meridian\",33],PARAMETER[\"scale_factor\",0.9996],PARAMETER[\"false_easting\",500000],PARAMETER[\"false_northing\",0],UNIT[\"metre\",1,AUTHORITY[\"EPSG\",\"9001\"]],AXIS[\"Easting\",EAST],AXIS[\"Northing\",NORTH],AUTHORITY[\"EPSG\",\"32636\"]]"
        },

I appreciate that it's a bit of a disaster that the Zarr spec (or any spec) doesn't appear to declare this , and also it's entirely weird to have this comparct form of georeferencing while also having the legacy-style X,Y degenerate rectlinear arrays. This is not a GDAL failure, but I would like to get this supported as a more explicit and clear way of georeferencing Zarr.

mdsumner avatar Apr 17 '25 11:04 mdsumner

I appreciate that it's a bit of a disaster that the Zarr spec (or any spec) doesn't appear to declare this , and also it's entirely weird to have this comparct form of georeferencing while also having the legacy-style X,Y degenerate rectlinear arrays. This is not a GDAL failure, but I would like to get this supported as a more explicit and clear way of georeferencing Zarr.

cf the mailing list exchange about a future EOPF driver to accommodate this: https://lists.osgeo.org/pipermail/gdal-dev/2025-April/thread.html#60414

rouault avatar Apr 17 '25 11:04 rouault

Some interesting tidbits:

A take on CF/tif/Zarr harmonization

https://github.com/zarr-developers/geozarr-spec/pull/67

Response to PR for EOPF-aligned CRS+transform (which, I consider a todo for me):

https://github.com/EOPF-Sample-Service/GDAL-ZARR-EOPF/issues/20#issuecomment-2890251902

mdsumner avatar May 26 '25 23:05 mdsumner