Call for Prototype/Implementation Owners for Different GeoZarr Conformance Classes
In the April 2nd monthly meeting, @christophenoel gave a great presentation explaining abstract data models, file formats, and encodings. He gave great context explaining how HDF, CF, and GDAL work, and proposed a meta-model as a bridge to zarr. More importantly he identified that what we have struggled with most in GeoZarr is trying to resolve issues that stem from diverging abstract geospatial data models.
In this proposed unified data model, there would be specific Profiles (labeled in this slide as Feature Types, but the group agrees to move forward with the terminology of Profiles):
The group that attended the call agreed with this characterization and approach. To move this conversation forward, we want to identify point of contacts who will own a specific type of Profile that is desired to work with GeoZarr. These owners would be responsible for prototyping specific encodings in GeoZarr and support full round-trip translation between the existing data model implementation to GeoZarr and back.
Below is the working list of Profiles that we need to identify owners for - please add additional Profiles to this list and suggestions of the best people to engage with:
- RGB Raster
- Single Variable Raster
- 3D Raster (XYT, XYZ)
- Hyperspectral
| Profile | Proposed Owners |
|---|---|
| RGB Raster | |
| Single Variable Raster | |
| 3D Raster (XYT, XYZ) | @maxrjones ? @ethanrd ? @briannapagan @dcherian |
| Hyperspectral | |
| SAR SLC | @emmanuelmathot |
| DEM | @emmanuelmathot |
Once we have an agreed upon list of Profiles and identified potential owners, I suggest this focus group to meet at a more frequent interval than monthly to coordinate. Of course open to any suggestions and feedback!
Here are some other potential profiles, that might be considered new items or could be folded into items in the list above:
-
Multi-spectral - more than a 3 band RGB, but less than hyperspectral. Examples: Landsat, MODIS, Sentinel-2 -3 -5P products
-
Topography - 2-D raster with one or more height bands. Examples: SRTM DEM, DTMs, DSMs
-
SAR Single Look Complex (SLC) - Images containing both scaler and complex values. Examples: Sentinel-1 bursts
I volunteer for the SAR SLC and DEM. I believe it is important to define the ideal profile for access patterns that align with various use cases: simple screening, terrain correction, interferometry using topsar bursting passing, ...
EDIT: and for multispectral, having different groups of resolution must be also addressed.
I would add the GDAL multidimensional model, and at least be clear that "GDAL" (as above) usually means "2D classic raster" (setting aside the warper api and the geoocation frameworks), I don't think that's been considered.
edit: I wrote a bit about it here, it's not much but I've only just got my teeth into it in recent weeks https://www.hypertidy.org/posts/2025-03-12-r-py-multidim/r-py-multidim
oops, my apologies I see @christophenoel did cover this, very glad to see
(awesome having this video and transcript!)
@christophenoel could you share the slides of this presentation?
@christophenoel could you share the slides of this presentation?
the link to the slides has been shared in the public geozarr google group on https://groups.google.com/u/0/g/geozarr/c/9NbEa84BBSA and is https://drive.google.com/file/d/1zoIhQK-J4fSM3dsRdWXXXW9v57GrhjTi/view?usp=sharing
I'd love to assist with the 3D raster feature.
Thanks for sharing the link. I'm, Interested in the first four items, with a preference to start with the RGB case and a single-variable example initially.
For info, I created branch cnl-examples with an initial RGB raster profile example in both Zarr V2 and Zarr V3 formats. The examples are provided in a Jupyter Notebook intended for automatic launch via MyBinder.
You can test by creating the Jupyter environment simply by accessing the URL: binder
Note: the V3 was created using an old library. I will fix this.
I'm up for RGB,single, and DEM, especially where they overlap with VRT or GTI (COP30, GEBCO, terrain RGB)
How about XYZT? Thredds servers via fileServer vs dodsC, there's a few good examples on NCI here
don't know anything about hyperspectral 😀
While drafting the raster profiles and their examples, it became clear that some profiles—such as time-series-raster—serve best as complementary extensions to core 2D raster profiles (e.g. scalar-raster, rgb-raster). They add requirements for specific dimensions (e.g. time ) but do not redefine the overall structure.
To maintain interoperability and simplicity, the number of combinations must remain limited. Excessive flexibility would increase complexity for applications and hinder standardisation efforts.
📦 These initial drafts are available in a dedicated branch cnl-examples, along with working examples:
- scalar-raster – single 2D variable over (y, x)
- rgb-raster – RGB image with dimensions (band, y, x)
- band-raster – band-indexed data with optional wavelength mapping
- multispectral-raster – physical wavelengths as a dimension (wavelength, y, x)
- time-series-raster – adds a time dimension to base rasters
- vertical-raster – 3D raster over (z, y, x), for altitude or vertical level
📓 You can explore them directly in a Jupyter environment using Binder: 👉 Launch examples notebook
Thank you @christophenoel for these examples already! @emmanuelmathot, @maxrjones and I chatted yesterday about how to address this work and I want to make sure folks volunteering aren't diverging too much in expectations. Can I propose folks who have volunteered to find a time to meet next week and discuss goals of this exercise?
Thank you, @christophenoel. In the meantime, could you transform the cnl-branch into a PR to allow commenting on your input?
@emmanuelmathot I prefer to wait until the work is split into distinct tasks. This approach avoids dealing with a large PR that generates scattered discussions. I think a branch for each profile should be created. Additionally, the current branch contains only early drafts.
Note that I am preparing example for Zarr v3.
The constraint concerning projected coordinates (projection_x_coordinate) seems overly restrictive and could be handled through an additional profile.
I think rgb_raster is not necessary and the band_raster can encompass the role. This is also closer to STAC band construct model and allow for better alignment in the future.
I agree. But maybe rgb_raster can be a profile refining band_raster (which means: includes at least red, green, blue)
Separate interpretation of sets of bands from their type
I don't think we have to model ambiguity of ZT from sets of types. It's a convention of sorts to model colour vs time vs depth vs any arbitrary coordinate space
TIFF can only specify grey, RGB, RGBa, multiband of any number of -type-
I wonder if we're mixing GDAL heuristics with actual tiff models
Hi @mdsumner , I'm not sure to understand to what you're replying exactly ? What is important to me, is to provide the ability to a client application to detect that there are RGB colors that can be displayed. There are multiple possible approach of course, but such standard profile would make sense to me.
Bare with me, I think I'm so used to human-detection of interpretation I can't even imagine a standard for that
To eliminate the ambiguity between data type and interpretation, the symbology extension (based on OGC symbology) proposed in the initial GeoZarr draft appears to offer a suitable solution. (see https://github.com/zarr-developers/geozarr-spec/blob/main/geozarr-spec.md#portrayals-and-symbology )
(Edit: however, a lot of GeoTiff would match this RGB profile, and allows detecting a possible mapping/export to GeoTiff.)
@christophenoel @emmanuelmathot @mdsumner In the CNG #geozarr slack channel I posed some times for next week to chat.
@christophenoel @emmanuelmathot @mdsumner @rabernat
Defining a single profile to cover all kinds of rasters and datacubes is difficult. These datasets can include many different combinations—such as time, height, or wavelength—and can use either a projected or geographic coordinate system. In OGC, a profile is meant to tailor a standard for a specific use or community, not to describe every possible variation.
From my point of view, a better approach is to use OGC conformance classes ((see conformance classes). These are clear, testable building blocks. Each dataset can declare which classes it follows—like “has time”, “uses projected coordinates”, or “includes multiple bands
This makes it easier to describe what a dataset contains, and to check that it meets the expected rules. Instead of one big profile, each dataset is a combination of smaller, well-defined parts.
Note: regarding the "meta model" spec approach, see the PR: https://github.com/zarr-developers/geozarr-spec/pull/64
Here's some text on the approach I proposed at the last meeting
GeoZarr Composable Conformance Classes
Defining a single profile to cover all kinds of rasters and datacubes is difficult. These datasets can include many different combinations—such as time, height, or wavelength—and can use either a projected or geographic coordinate system.
The Four Dimensions of Profiles
GeoZarr datasets may be classified within a multi-dimensional space of options. This option space includes:
- Data variables types - This characterizes the data values themselves. Options include
- Single-band raster (a single array)
- Multi-band raster - multiple bands with the same dtype and resolution stored as an additional
banddimension on an array, with named bands (e.g.B01,red) - Hyperspectral raster - similar to multi-band raster, but with more bands and an encoding of the
banddimension as specific wavelengths ranges - CF-style data variables. Following CF conventions each variable is stored as a separate array with
standard_nameattribute.
- Horizontal geospatial coordinate type - This describes how the horizontal (x, y) coordinates of the data are specified. Broadly speaking, options include
- GDAL-style projected raster coordinates. Here the data are treated as pixels on a rectangular grid, with georeferences provided by a GeoTransform and CRS.
- CF-style explicit coordinates. Here the coordinates are encoded using NetCDF / CF conventions, with all of the possibilities allowed therein (e.g. independent latitude and longitude coordinte, two-dimensional latitude, longitude coordinate )
- Discrete Global Grid Systems (DGGS). Here the data are represented as cells within a specific DGGS. (Encoding is still TBD.)
- Vertical coordinate type - This describes the vertical dimension of the data. Options include
- None - no vertical dimension provided
- CF-Style vertical coordinate (ref)
- Temporal coordinate type
- None - no temporal dimension provided.
- CF-style time coordinate (ref)
Examples
| dataset | variable type | horiz. coord. | vert. coord. | time coord |
|---|---|---|---|---|
| Sentinel 2 Scene | Multiband Raster | GDAL | None | None |
| Harmonized Sentinel Datacube | Multiband Raster | GDAL | None | CF |
| CMIP6 Output | CF | CF | CF | CF |
Thank you for kicking this off Ryan, this is a great foundation to build from! I am trying to think where sparse/ragged data, would fall into the existing table. It looks like CF conventions work: https://www.ncei.noaa.gov/netcdf-ragged-array-format, so perhaps there is another column of just 'n coord' where variable and 'n-coord' would fall under CF. Also trying to follow: https://github.com/pydata/xarray/discussions/7988
I like the idea of the "building block" options for constructing profiles, but I have some questions/comments on the current option descriptions.
In regards to "multi-band raster" type, many imaging satellite data products may not fit this definition, because the pixel spacing ("resolution") and/or dtype differs between bands. For example, Landsat 9 has bands with 15m, 30m, and 100m pixels, and the bands have different data types (INT16, UINT16, UINT8). Could the data variable type dimension option be expanded to accommodate this, or would this require a satellite data product to be composed of multiple GeoZarr multi-band rasters?
Also I'm not sure about the distinction between multi-band and hyper-spectral... multi-spectral satellite data products often have bands that have specific wavelength ranges, and this information is important when trying to harmonize bands between different sensors (example: Landsat 9 and Sentinel-2).
I am trying to think where sparse/ragged data, would fall into the existing table
They currently don't. What I wrote above is focused on dense rasters. Could you clarify the specific use case you have in mind here (e.g. an example from an existing data product)?
many imaging satellite data products may not fit this definition, because the pixel spacing ("resolution") and/or dtype differs between bands
This is a good point Tyler. Zarr can't treat arrays of different shape or dtype as part of the same array. (Related perhaps to Brianna's comment about ragged arrays.) In this case, the different bands of different resolution would have to be stored as distinct arrays. In the CF coordinate model, they would also need distinct dimension coordinates. (Not sure how the GDAL raster coordinate model handles that case; is the affine transform the same?)
But in summary, yes, we would need to modify this categorization to allow for this scenario.
Also I'm not sure about the distinction between multi-band and hyper-spectral
I agree it's a fuzzy distinction. Is there an existing metadata convention that covers this somehow, e.g. in STAC? AFAIK CF does not.
(Not sure how the GDAL raster coordinate model handles that case; is the affine transform the same?)
GDAL calls these subdatasets, and that case (can't be stored on the same array) is exactly when a container format will present as subdatasets. Each one then has its own crs and transform (these could be grouped together but will or won't be depending on driver details, I think)
e.g. snipping out a few subdatasets from this file to show the range of array sizes (here unrolled as bands in GDAL classic mode for dims > yx), each "*_NAME=" here is a classic 2D raster with its own transform and crs
gdalinfo "ZARR:\"/vsizip//vsicurl/https://eopf-public.s3.sbg.perf.cloud.ovh.net/eoproducts/S02MSIL1C_20230629T063559_0000_A064_T3A5.zarr.zip\""
...
Subdatasets:
SUBDATASET_1_NAME=ZARR:"/vsizip//vsicurl/https://eopf-public.s3.sbg.perf.cloud.ovh.net/eoproducts/S02MSIL1C_20230629T063559_0000_A064_T3A5.zarr.zip":/conditions/geometry/mean_viewing_incidence_angles
SUBDATASET_1_DESC=[13x2] /conditions/geometry/mean_viewing_incidence_angles (Float64)
SUBDATASET_2_NAME=ZARR:"/vsizip//vsicurl/https://eopf-public.s3.sbg.perf.cloud.ovh.net/eoproducts/S02MSIL1C_20230629T063559_0000_A064_T3A5.zarr.zip":/conditions/geometry/sun_angles
SUBDATASET_2_DESC=[2x23x23] /conditions/geometry/sun_angles (Float64)
...
SUBDATASET_3_NAME=ZARR:"/vsizip//vsicurl/https://eopf-public.s3.sbg.perf.cloud.ovh.net/eoproducts/S02MSIL1C_20230629T063559_0000_A064_T3A5.zarr.zip":/conditions/geometry/viewing_incidence_angles
SUBDATASET_3_DESC=[13x4x2x23x23] /conditions/geometry/viewing_incidence_angles (Float64)
...
SUBDATASET_7_NAME=ZARR:"/vsizip//vsicurl/https://eopf-public.s3.sbg.perf.cloud.ovh.net/eoproducts/S02MSIL1C_20230629T063559_0000_A064_T3A5.zarr.zip":/conditions/mask/detector_footprint/r10m/b08
SUBDATASET_7_DESC=[10980x10980] /conditions/mask/detector_footprint/r10m/b08 (Byte)
SUBDATASET_8_NAME=ZARR:"/vsizip//vsicurl/https://eopf-public.s3.sbg.perf.cloud.ovh.net/eoproducts/S02MSIL1C_20230629T063559_0000_A064_T3A5.zarr.zip":/conditions/mask/detector_footprint/r20m/b05
SUBDATASET_8_DESC=[5490x5490] /conditions/mask/detector_footprint/r20m/b05 (Byte)
that's classic mode, in multidimensional mode it's a lot more like zarr groups and arrays
@rabernat The specific dataset I was thinking of was an example was OCO-2. https://disc.gsfc.nasa.gov/datasets/OCO2_L2_Lite_FP_11.2r/summary?keywords=oco2
Any sounding type of dataset or level-2 product would be similar.