geozarr-spec icon indicating copy to clipboard operation
geozarr-spec copied to clipboard

Call for Prototype/Implementation Owners for Different GeoZarr Conformance Classes

Open brianna-corremonte opened this issue 8 months ago • 43 comments

In the April 2nd monthly meeting, @christophenoel gave a great presentation explaining abstract data models, file formats, and encodings. He gave great context explaining how HDF, CF, and GDAL work, and proposed a meta-model as a bridge to zarr. More importantly he identified that what we have struggled with most in GeoZarr is trying to resolve issues that stem from diverging abstract geospatial data models.

In this proposed unified data model, there would be specific Profiles (labeled in this slide as Feature Types, but the group agrees to move forward with the terminology of Profiles):

Image Image

The group that attended the call agreed with this characterization and approach. To move this conversation forward, we want to identify point of contacts who will own a specific type of Profile that is desired to work with GeoZarr. These owners would be responsible for prototyping specific encodings in GeoZarr and support full round-trip translation between the existing data model implementation to GeoZarr and back.

Below is the working list of Profiles that we need to identify owners for - please add additional Profiles to this list and suggestions of the best people to engage with:

  • RGB Raster
  • Single Variable Raster
  • 3D Raster (XYT, XYZ)
  • Hyperspectral
Profile Proposed Owners
RGB Raster
Single Variable Raster
3D Raster (XYT, XYZ) @maxrjones ? @ethanrd ? @briannapagan @dcherian
Hyperspectral
SAR SLC @emmanuelmathot
DEM @emmanuelmathot

Once we have an agreed upon list of Profiles and identified potential owners, I suggest this focus group to meet at a more frequent interval than monthly to coordinate. Of course open to any suggestions and feedback!

brianna-corremonte avatar Apr 03 '25 14:04 brianna-corremonte

Here are some other potential profiles, that might be considered new items or could be folded into items in the list above:

  • Multi-spectral - more than a 3 band RGB, but less than hyperspectral. Examples: Landsat, MODIS, Sentinel-2 -3 -5P products

  • Topography - 2-D raster with one or more height bands. Examples: SRTM DEM, DTMs, DSMs

  • SAR Single Look Complex (SLC) - Images containing both scaler and complex values. Examples: Sentinel-1 bursts

tylere avatar Apr 03 '25 17:04 tylere

I volunteer for the SAR SLC and DEM. I believe it is important to define the ideal profile for access patterns that align with various use cases: simple screening, terrain correction, interferometry using topsar bursting passing, ...

EDIT: and for multispectral, having different groups of resolution must be also addressed.

emmanuelmathot avatar Apr 03 '25 20:04 emmanuelmathot

I would add the GDAL multidimensional model, and at least be clear that "GDAL" (as above) usually means "2D classic raster" (setting aside the warper api and the geoocation frameworks), I don't think that's been considered.

edit: I wrote a bit about it here, it's not much but I've only just got my teeth into it in recent weeks https://www.hypertidy.org/posts/2025-03-12-r-py-multidim/r-py-multidim

mdsumner avatar Apr 03 '25 20:04 mdsumner

oops, my apologies I see @christophenoel did cover this, very glad to see

(awesome having this video and transcript!)

mdsumner avatar Apr 04 '25 01:04 mdsumner

@christophenoel could you share the slides of this presentation?

felixcremer avatar Apr 04 '25 08:04 felixcremer

@christophenoel could you share the slides of this presentation?

the link to the slides has been shared in the public geozarr google group on https://groups.google.com/u/0/g/geozarr/c/9NbEa84BBSA and is https://drive.google.com/file/d/1zoIhQK-J4fSM3dsRdWXXXW9v57GrhjTi/view?usp=sharing

echarles avatar Apr 04 '25 19:04 echarles

I'd love to assist with the 3D raster feature.

dcherian avatar Apr 04 '25 21:04 dcherian

Thanks for sharing the link. I'm, Interested in the first four items, with a preference to start with the RGB case and a single-variable example initially.

For info, I created branch cnl-examples with an initial RGB raster profile example in both Zarr V2 and Zarr V3 formats. The examples are provided in a Jupyter Notebook intended for automatic launch via MyBinder.

You can test by creating the Jupyter environment simply by accessing the URL: binder

Image

christophenoel avatar Apr 07 '25 15:04 christophenoel

Note: the V3 was created using an old library. I will fix this.

christophenoel avatar Apr 07 '25 15:04 christophenoel

I'm up for RGB,single, and DEM, especially where they overlap with VRT or GTI (COP30, GEBCO, terrain RGB)

How about XYZT? Thredds servers via fileServer vs dodsC, there's a few good examples on NCI here

don't know anything about hyperspectral 😀

mdsumner avatar Apr 07 '25 20:04 mdsumner

While drafting the raster profiles and their examples, it became clear that some profiles—such as time-series-raster—serve best as complementary extensions to core 2D raster profiles (e.g. scalar-raster, rgb-raster). They add requirements for specific dimensions (e.g. time ) but do not redefine the overall structure.

To maintain interoperability and simplicity, the number of combinations must remain limited. Excessive flexibility would increase complexity for applications and hinder standardisation efforts.

📦 These initial drafts are available in a dedicated branch cnl-examples, along with working examples:

📓 You can explore them directly in a Jupyter environment using Binder: 👉 Launch examples notebook

christophenoel avatar Apr 08 '25 13:04 christophenoel

Thank you @christophenoel for these examples already! @emmanuelmathot, @maxrjones and I chatted yesterday about how to address this work and I want to make sure folks volunteering aren't diverging too much in expectations. Can I propose folks who have volunteered to find a time to meet next week and discuss goals of this exercise?

brianna-corremonte avatar Apr 08 '25 15:04 brianna-corremonte

Thank you, @christophenoel. In the meantime, could you transform the cnl-branch into a PR to allow commenting on your input?

emmanuelmathot avatar Apr 08 '25 17:04 emmanuelmathot

@emmanuelmathot I prefer to wait until the work is split into distinct tasks. This approach avoids dealing with a large PR that generates scattered discussions. I think a branch for each profile should be created. Additionally, the current branch contains only early drafts.

Note that I am preparing example for Zarr v3.

The constraint concerning projected coordinates (projection_x_coordinate) seems overly restrictive and could be handled through an additional profile.

christophenoel avatar Apr 09 '25 08:04 christophenoel

I think rgb_raster is not necessary and the band_raster can encompass the role. This is also closer to STAC band construct model and allow for better alignment in the future.

emmanuelmathot avatar Apr 09 '25 08:04 emmanuelmathot

I agree. But maybe rgb_raster can be a profile refining band_raster (which means: includes at least red, green, blue)

christophenoel avatar Apr 09 '25 09:04 christophenoel

Separate interpretation of sets of bands from their type

I don't think we have to model ambiguity of ZT from sets of types. It's a convention of sorts to model colour vs time vs depth vs any arbitrary coordinate space

TIFF can only specify grey, RGB, RGBa, multiband of any number of -type-

I wonder if we're mixing GDAL heuristics with actual tiff models

mdsumner avatar Apr 09 '25 11:04 mdsumner

Hi @mdsumner , I'm not sure to understand to what you're replying exactly ? What is important to me, is to provide the ability to a client application to detect that there are RGB colors that can be displayed. There are multiple possible approach of course, but such standard profile would make sense to me.

christophenoel avatar Apr 09 '25 11:04 christophenoel

Bare with me, I think I'm so used to human-detection of interpretation I can't even imagine a standard for that

mdsumner avatar Apr 09 '25 12:04 mdsumner

To eliminate the ambiguity between data type and interpretation, the symbology extension (based on OGC symbology) proposed in the initial GeoZarr draft appears to offer a suitable solution. (see https://github.com/zarr-developers/geozarr-spec/blob/main/geozarr-spec.md#portrayals-and-symbology )

(Edit: however, a lot of GeoTiff would match this RGB profile, and allows detecting a possible mapping/export to GeoTiff.)

christophenoel avatar Apr 09 '25 13:04 christophenoel

@christophenoel @emmanuelmathot @mdsumner In the CNG #geozarr slack channel I posed some times for next week to chat.

brianna-corremonte avatar Apr 09 '25 21:04 brianna-corremonte

@christophenoel @emmanuelmathot @mdsumner @rabernat

Defining a single profile to cover all kinds of rasters and datacubes is difficult. These datasets can include many different combinations—such as time, height, or wavelength—and can use either a projected or geographic coordinate system. In OGC, a profile is meant to tailor a standard for a specific use or community, not to describe every possible variation.

From my point of view, a better approach is to use OGC conformance classes ((see conformance classes). These are clear, testable building blocks. Each dataset can declare which classes it follows—like “has time”, “uses projected coordinates”, or “includes multiple bands

This makes it easier to describe what a dataset contains, and to check that it meets the expected rules. Instead of one big profile, each dataset is a combination of smaller, well-defined parts.

Image

christophenoel avatar Apr 14 '25 06:04 christophenoel

Note: regarding the "meta model" spec approach, see the PR: https://github.com/zarr-developers/geozarr-spec/pull/64

christophenoel avatar Apr 15 '25 11:04 christophenoel

The latest Editor's Draft version of OGC GeoZarr Specificationis found here in HTML or PDF

christophenoel avatar Apr 15 '25 13:04 christophenoel

Here's some text on the approach I proposed at the last meeting

GeoZarr Composable Conformance Classes

Defining a single profile to cover all kinds of rasters and datacubes is difficult. These datasets can include many different combinations—such as time, height, or wavelength—and can use either a projected or geographic coordinate system.

The Four Dimensions of Profiles

GeoZarr datasets may be classified within a multi-dimensional space of options. This option space includes:

  • Data variables types - This characterizes the data values themselves. Options include
    • Single-band raster (a single array)
    • Multi-band raster - multiple bands with the same dtype and resolution stored as an additional band dimension on an array, with named bands (e.g. B01, red)
    • Hyperspectral raster - similar to multi-band raster, but with more bands and an encoding of the band dimension as specific wavelengths ranges
    • CF-style data variables. Following CF conventions each variable is stored as a separate array with standard_name attribute.
  • Horizontal geospatial coordinate type - This describes how the horizontal (x, y) coordinates of the data are specified. Broadly speaking, options include
    • GDAL-style projected raster coordinates. Here the data are treated as pixels on a rectangular grid, with georeferences provided by a GeoTransform and CRS.
    • CF-style explicit coordinates. Here the coordinates are encoded using NetCDF / CF conventions, with all of the possibilities allowed therein (e.g. independent latitude and longitude coordinte, two-dimensional latitude, longitude coordinate )
    • Discrete Global Grid Systems (DGGS). Here the data are represented as cells within a specific DGGS. (Encoding is still TBD.)
  • Vertical coordinate type - This describes the vertical dimension of the data. Options include
    • None - no vertical dimension provided
    • CF-Style vertical coordinate (ref)
  • Temporal coordinate type
    • None - no temporal dimension provided.
    • CF-style time coordinate (ref)

Image

Examples

dataset variable type horiz. coord. vert. coord. time coord
Sentinel 2 Scene Multiband Raster GDAL None None
Harmonized Sentinel Datacube Multiband Raster GDAL None CF
CMIP6 Output CF CF CF CF

rabernat avatar Apr 24 '25 13:04 rabernat

Thank you for kicking this off Ryan, this is a great foundation to build from! I am trying to think where sparse/ragged data, would fall into the existing table. It looks like CF conventions work: https://www.ncei.noaa.gov/netcdf-ragged-array-format, so perhaps there is another column of just 'n coord' where variable and 'n-coord' would fall under CF. Also trying to follow: https://github.com/pydata/xarray/discussions/7988

brianna-corremonte avatar Apr 24 '25 15:04 brianna-corremonte

I like the idea of the "building block" options for constructing profiles, but I have some questions/comments on the current option descriptions.

In regards to "multi-band raster" type, many imaging satellite data products may not fit this definition, because the pixel spacing ("resolution") and/or dtype differs between bands. For example, Landsat 9 has bands with 15m, 30m, and 100m pixels, and the bands have different data types (INT16, UINT16, UINT8). Could the data variable type dimension option be expanded to accommodate this, or would this require a satellite data product to be composed of multiple GeoZarr multi-band rasters?

Also I'm not sure about the distinction between multi-band and hyper-spectral... multi-spectral satellite data products often have bands that have specific wavelength ranges, and this information is important when trying to harmonize bands between different sensors (example: Landsat 9 and Sentinel-2).

tylere avatar Apr 24 '25 21:04 tylere

I am trying to think where sparse/ragged data, would fall into the existing table

They currently don't. What I wrote above is focused on dense rasters. Could you clarify the specific use case you have in mind here (e.g. an example from an existing data product)?

many imaging satellite data products may not fit this definition, because the pixel spacing ("resolution") and/or dtype differs between bands

This is a good point Tyler. Zarr can't treat arrays of different shape or dtype as part of the same array. (Related perhaps to Brianna's comment about ragged arrays.) In this case, the different bands of different resolution would have to be stored as distinct arrays. In the CF coordinate model, they would also need distinct dimension coordinates. (Not sure how the GDAL raster coordinate model handles that case; is the affine transform the same?)

But in summary, yes, we would need to modify this categorization to allow for this scenario.

Also I'm not sure about the distinction between multi-band and hyper-spectral

I agree it's a fuzzy distinction. Is there an existing metadata convention that covers this somehow, e.g. in STAC? AFAIK CF does not.

rabernat avatar Apr 24 '25 21:04 rabernat

(Not sure how the GDAL raster coordinate model handles that case; is the affine transform the same?)

GDAL calls these subdatasets, and that case (can't be stored on the same array) is exactly when a container format will present as subdatasets. Each one then has its own crs and transform (these could be grouped together but will or won't be depending on driver details, I think)

e.g. snipping out a few subdatasets from this file to show the range of array sizes (here unrolled as bands in GDAL classic mode for dims > yx), each "*_NAME=" here is a classic 2D raster with its own transform and crs

gdalinfo   "ZARR:\"/vsizip//vsicurl/https://eopf-public.s3.sbg.perf.cloud.ovh.net/eoproducts/S02MSIL1C_20230629T063559_0000_A064_T3A5.zarr.zip\""

...
Subdatasets:
  SUBDATASET_1_NAME=ZARR:"/vsizip//vsicurl/https://eopf-public.s3.sbg.perf.cloud.ovh.net/eoproducts/S02MSIL1C_20230629T063559_0000_A064_T3A5.zarr.zip":/conditions/geometry/mean_viewing_incidence_angles
  SUBDATASET_1_DESC=[13x2] /conditions/geometry/mean_viewing_incidence_angles (Float64)
  SUBDATASET_2_NAME=ZARR:"/vsizip//vsicurl/https://eopf-public.s3.sbg.perf.cloud.ovh.net/eoproducts/S02MSIL1C_20230629T063559_0000_A064_T3A5.zarr.zip":/conditions/geometry/sun_angles
  SUBDATASET_2_DESC=[2x23x23] /conditions/geometry/sun_angles (Float64)
...
  SUBDATASET_3_NAME=ZARR:"/vsizip//vsicurl/https://eopf-public.s3.sbg.perf.cloud.ovh.net/eoproducts/S02MSIL1C_20230629T063559_0000_A064_T3A5.zarr.zip":/conditions/geometry/viewing_incidence_angles
  SUBDATASET_3_DESC=[13x4x2x23x23] /conditions/geometry/viewing_incidence_angles (Float64)
...
  SUBDATASET_7_NAME=ZARR:"/vsizip//vsicurl/https://eopf-public.s3.sbg.perf.cloud.ovh.net/eoproducts/S02MSIL1C_20230629T063559_0000_A064_T3A5.zarr.zip":/conditions/mask/detector_footprint/r10m/b08
  SUBDATASET_7_DESC=[10980x10980] /conditions/mask/detector_footprint/r10m/b08 (Byte)
  SUBDATASET_8_NAME=ZARR:"/vsizip//vsicurl/https://eopf-public.s3.sbg.perf.cloud.ovh.net/eoproducts/S02MSIL1C_20230629T063559_0000_A064_T3A5.zarr.zip":/conditions/mask/detector_footprint/r20m/b05
  SUBDATASET_8_DESC=[5490x5490] /conditions/mask/detector_footprint/r20m/b05 (Byte)

that's classic mode, in multidimensional mode it's a lot more like zarr groups and arrays

mdsumner avatar Apr 25 '25 00:04 mdsumner

@rabernat The specific dataset I was thinking of was an example was OCO-2. https://disc.gsfc.nasa.gov/datasets/OCO2_L2_Lite_FP_11.2r/summary?keywords=oco2

Any sounding type of dataset or level-2 product would be similar.

Image

brianna-corremonte avatar Apr 25 '25 05:04 brianna-corremonte