geozarr-spec icon indicating copy to clipboard operation
geozarr-spec copied to clipboard

Multiscale hierarchy structure needs clarification

Open emmanuelmathot opened this issue 5 months ago • 46 comments

Summary

The specification describes multiscale encoding but doesn't clearly define the exact hierarchical structure and relationship between parent groups and zoom level children.

Current Problem

  • Section 9.7.1 mentions "hierarchical layout" but lacks precise structure definition
  • Unclear relationship between multiscale root group and zoom level groups
  • Missing guidance on metadata inheritance between levels
  • Ambiguous about where multiscales metadata should be located

Proposed Solution

  1. Define precise hierarchy structure with clear examples
  2. Clarify metadata placement and inheritance rules
  3. Specify group naming conventions for zoom levels
  4. Document parent-child relationships explicitly

Implementation Evidence

The EOPF implementation uses this hierarchy:

/measurements/r10m/          # Parent group with multiscales metadata
├── 0/                       # Native resolution (zoom level 0)
│   ├── band1
│   ├── band2
│   └── spatial_ref
├── 1/                       # First overview level
│   ├── band1
│   ├── band2
│   └── spatial_ref
└── 2/                       # Second overview level
    ├── band1
    ├── band2
    └── spatial_ref

With multiscales metadata at the parent level:

zarr_json_attributes["multiscales"] = {
    "tile_matrix_set": tile_matrix_set,
    "resampling_method": "average",
    "tile_matrix_limits": tile_matrix_limits,
}

Specification Sections to Update

  • Section 9.7.1 (Hierarchical Layout)
  • Section 9.7.2 (Metadata Encoding)
  • Add clear structural diagrams and examples

cc @vincentsarago, @maxrjones, @d-v-b, @briannapagan

emmanuelmathot avatar Aug 14 '25 07:08 emmanuelmathot

It should clarify decimation requirements too, the OGC TMS spec isn't clear here. Decimation of 2 is pretty standard but that isn't always the case with OGC TMS - https://github.com/developmentseed/morecantile/issues/147

geospatial-jeff avatar Aug 14 '25 15:08 geospatial-jeff

I agree that the current model of overviews described in clause 7 – unified data model and clause 9 – Zarr encoding overviews is not yet fully clear or complete.

In particular:

  • A Dataset Group, which contains one or multiple data variables, may include or exclude overviews. I believe the Dataset Group itself would hold the native resolution directly (a dedicated 0/ zoom-level group would not exist) as overviews can be added later, and even not be used/supported by the client.
  • As illustrated in your example, a zoom-level group may also contain multiple variables, not just a single one.
  • Variables are not necessarily restricted to 2D. While COG only supports 2D, it is not obvious whether the TMS model can always apply to variables with dimensions such as X, Y, Band (i.e. more than two spatial dimensions). This needs clarification.

christophenoel avatar Sep 03 '25 12:09 christophenoel

  • A Dataset Group, which contains one or multiple data variables, may include or exclude overviews. I believe the Dataset Group itself would hold the native resolution directly (a dedicated 0/ zoom-level group would not exist) as overviews can be added later, and even not be used/supported by the client.

So I understand correctly, are you suggesting a layout like this would be permitted?:

/measurements/r10m/          # Parent group with multiscales metadata
├── band1
├── band2
├── spatial_ref
├── 1/                       # First overview level
│   ├── band1
│   ├── band2
│   └── spatial_ref
└── 2/                       # Second overview level
    ├── band1
    ├── band2
    └── spatial_ref

d-v-b avatar Sep 03 '25 12:09 d-v-b

@d-v-b @christophenoel, xarray do not allow this layout since it enforce data alignement

The data in different datatree nodes are not totally independent. In particular dimensions (and indexes) in child nodes must be exactly aligned with those in their parent nodes. Exact alignment means that shared dimensions must be the same length, and indexes along those dimensions must be equal.

from https://docs.xarray.dev/en/latest/user-guide/hierarchical-data.html#data-alignment

I dont know if this is a contraint in the Unified Data Model too...

emmanuelmathot avatar Sep 03 '25 12:09 emmanuelmathot

Indeed. I have illustrated in my review: PR #86

christophenoel avatar Sep 03 '25 12:09 christophenoel

@d-v-b @christophenoel, xarray do not allow this layout since it enforce data alignement

The note you quoted applies inside a single Dataset (or DataTree node): all data variables within that Dataset must share consistent dimensions and coordinate indexes. This is the same principle as in the Unified Data Model, where a Dataset is defined as a coherent collection of variables with aligned dimensions

It does not mean that a parent Dataset (native resolution) and a child Dataset (overview resolution) must have aligned spatial dimensions. Each zoom-level group is its own Dataset, and xarray will enforce alignment within each of those groups, not across them.

With both approch (native data in parent group or in a children group), each group is a Dataset and thus the variables nodes all share the same dimensions (this is part of the definition of a dataset in our model spec).

Note: xarray indeed can open a single dataset at a time (with potentially multiple variables within that dataset)

christophenoel avatar Sep 03 '25 12:09 christophenoel

The data in different datatree nodes are not totally independent

Why then the xarray documentation specifically says "datatree" and not "DataSet" knowing the terminology ?

On the other hand, I am pretty sure to have ran into the xarray error when trying to open a dataset with such a layout. I will try to reproduce it.

emmanuelmathot avatar Sep 03 '25 12:09 emmanuelmathot

You're right for the datatree definition, but from what I read Datatree is the new abstraction that xarray introduced to represent a hierarchy of datasets.

In xarray, you can:

  • Open a single NetCDF/Zarr group (root or any subgroup) directly as a flat xarray.Dataset.
  • Or, if you want to reflect the entire group hierarchy in one object, you use xarray.DataTree.

UDM does not enforce datatree constraint: of course, many NetCDF includes children groups which doesn't share the dimensions with parent groups.

christophenoel avatar Sep 03 '25 12:09 christophenoel

Why then the xarray documentation specifically says "datatree" and not "DataSet" knowing the terminology ?

On the other hand, I am pretty sure to have ran into the xarray error when trying to open a dataset with such a layout. I will try to reproduce it.

As clarified by David yesterday:

In xarray, DataTree provides a tree-structured container of multiple Dataset objects arranged in a hierarchy and with dimensions exactly aligned with those in their parent nodes

Dataset itself remains the standard container and is still used for reading groups with variables and coordinates that does not follow these constraints.

Note that the following simplified code create multiscales in chilren group:

import xarray as xr
# Select the 'abs' variable (preserve band dimension)
abs_data = biomass["abs"]
# Define scales for downsampling
scales = [2, 4, 8, 16, 32]  # Example scale factors
for i, scale in enumerate(scales, start=1):
    # Coarsen ONLY on x and y (preserve band dimension)
    downscaled = abs_data.coarsen(y=scale, x=scale, boundary="trim").mean()
    # Write to Zarr store under the subgroup path
    downscaled.to_dataset(name="abs").to_zarr( biomass_zarr_path, pathgroup=f"measurements/{i}",
        mode="w")

The following code open and get information about the downscales:

import zarr
store = zarr.open(biomass_zarr_path, mode="r")
measurements_group = store["measurements"]
# List child groups
child_groups = measurements_group.group_keys()
# Collect info about each group
group_info = []
for group in child_groups:
    grp = measurements_group[group]
    abs_array = grp["abs"]  # access the 'abs' array
    info = {
        "Group": group,
        "Shape": abs_array.shape,
        "Chunks": abs_array.chunks,
        "DType": abs_array.dtype,
        "Attributes": len(grp.attrs),
    }
    group_info.append(info)

The following code shows zoom level 0 and zoom level 5 comparison:

import xarray as xr
import matplotlib.pyplot as plt
# Load root dataset (zoom level 0)
biomass_root = xr.open_zarr(measurement_path, consolidated=False)
rgb_root = biomass_root["abs"].sel(band=[1, 2, 3])
# Load zoom level 5 dataset
zoom_level = 5
zoom_path = f"{measurement_path}/{zoom_level}"
biomass_zoom = xr.open_zarr(zoom_path, consolidated=False)
rgb_zoom = biomass_zoom["abs"].sel(band=[1, 2, 3])
# Plot side by side
fig, axes = plt.subplots(1, 2, figsize=(12, 6))
# Plot root
rgb_root.plot.imshow(ax=axes[0], rgb="band", robust=True)
axes[0].set_title("Zoom Level 0 (Full Resolution)")
# Plot zoom level 3
rgb_zoom.plot.imshow(ax=axes[1], rgb="band", robust=True)
axes[1].set_title(f"Zoom Level {zoom_level}")
plt.tight_layout()
plt.show()

The notebook will be released soon, I will keep you informed.

christophenoel avatar Sep 04 '25 07:09 christophenoel

I would be curious to hear more about how option A would break existing readers. Is the name of the dataset of particular important to readers, such that opening r10m/0 is a breaking change compared to opening r10m?

To clarify the context for client libraries:

  • There are already many missions (including those from the European Space Agency, which initiated GeoZarr) that expose petabytes of data encoded in Zarr and indexed in catalogues. These products are typically encoded with native data at the dataset level (metadata + child data variables/coordinates). Processing these archives into Zarr has been estimated to cost a lot of money. There is no realistic chance that data providers would adopt GeoZarr if adding overviews required rewriting existing products in this way.
  • In addition, there are many client applications developed around these existing products. These applications expect to continue working unchanged if overviews are introduced. Overviews is expected be defined as a backward-compatible extension, not something that alters the dataset’s core structure.

christophenoel avatar Sep 04 '25 07:09 christophenoel

In addition, there are many client applications developed around these existing products. These applications expect to continue working unchanged if overviews are introduced. Overviews is expected be defined as a backward-compatible extension, not something that alters the dataset’s core structure.

Thanks, this is very helpful context. If I understand this constraint correctly, it seems to imply that this spec should not introduce any features (e.g., overviews) that would require changes to existing applications. For example, the proposal to nest all geozarr attributes under a "geozarr" key seems to be discouraged for this reason.

If the goal of the spec is not to develop a new standard but rather to document specific existing practices, then I think this scope should be declared early on the spec, and the client applications that function as a hard constraint on spec development should be named as such.

d-v-b avatar Sep 04 '25 07:09 d-v-b

Let me nuance my previous comment. We are working in a democratic group, and while I defend the interests of my client (ESA), who initiated the GeoZarr effort, my voice carries no more weight than anyone.

Regarding the proposal to have a geozarr attribute: this would not affect existing applications, since they would simply ignore an unknown attribute and therefore not benefit from the additional functionality.

That said, I believe it is fairly well established that the specification should avoid changes that require restructuring existing data or create critical backward compatibility issues (such as reformatting existing dataset layouts). What was agree is that the existing CF/CDM ecosystem should be extended with GIS capabilities (and STAC, and ...), rather than reinventing an entirely new data model or specification. The scope should rather focus on extensions through additional metadata or new data, without breaking existing products.

I also agree that if this principle is approved, it should be made explicit in the specification, which still largely reflects its initial draft.

christophenoel avatar Sep 04 '25 08:09 christophenoel

@christophenoel, your argumentation is a bit paradoxical. How can we avoid changes to a specification that doesn't actually exist, having never been approved or published by any organization?

I'm increasingly frustrated by the burden of CDM and CF conventions. While these should serve as foundations to avoid reinventing the wheel (beneficial IMO), they shouldn't become rigid constraints we must always align with. If our primary constraint is accommodating existing tool implementations of a near-or-far geozarr spec (which, again, doesn't exist), then this isn't a specification—it's at best a best practices engineering report.

  • True specification approach: Define clear standards that existing implementations may need to adapt to
  • Documentation approach: Codify current inconsistent practices as-is

You can't standardize by committee when that committee is constrained by every existing implementation quirk. That's backwards. Libraries should adapt to good specifications, not vice versa.

If an organization has petabytes of data they don't want to reprocess, that's a migration problem, not a specification design constraint. Backward compatibility matters, but it can't paralyze standardization efforts.

We need to decide: are we writing GeoZarr the specification, or GeoZarr the engineering report? Right now we're stuck in limbo, creating neither effectively.

emmanuelmathot avatar Sep 04 '25 08:09 emmanuelmathot

I don’t think it’s paradoxical to call for backward compatibility. A specification that deliberately breaks alignment witthout a clear benefit with existing archives, risks becoming irrelevant to the very stakeholders it is supposed to serve.

However, that is only one of the arguments against option A. My main concerns remain:

  • The reprocessing and refactoring of the product is not only an issue for existing (legacy) products, but would also be required every time overviews are added. I don’t see this as acceptable.
  • I also don’t see clear reasons why the native data should become just another child group.

christophenoel avatar Sep 04 '25 09:09 christophenoel

just want to add that creating a completely new standard has been proposed in the past (including by myself), but after two years of discussion the conclusion was that we don’t need a new standard (for good or bad reasons). What is really needed is baiscally to extend xarray/Zarr to support additional capabilities.

That was essentially the agreement and conclusion. Of course, we can always restart the discussion again and again forever. :)

christophenoel avatar Sep 04 '25 09:09 christophenoel

The reprocessing and refactoring of the product is not only an issue for existing (legacy) products, but would also be required every time overviews are added. I don’t see this as acceptable.

Wouldn't a writer producing data for geozarr know in advance that there will be overviews? If so, the writer should use the overview layout from the start (i.e., write the source images to data/scale_0. Adding a new overview level requires adding a peer group to the original data, e.g. data/scale_1.

In bioimaging applications, when we started using image pyramids, we designed our data collection processes around the assumption that every individual image is just one of many scale levels. This meant the API for accessing any individual scale level was totally uniform. It also meant we could add upsampled data later without a confusing data layout. But this required adapting our data collection processes in light of a new layout, which I'm hearing might not be possible here?

I also don’t see clear reasons why the native data should become just another child group.

A few reasons:

  • If all overviews are separate child groups, the API for accessing an arbitrary overview is uniform. Uniformity is good. A uniform API is simple to express: "every scale level is in a separate group" takes up less text than "every scale level is in a separate group expect the original data, which is stored in the parent group".
  • An irregular layout means every client will have to contain logic for special-casing the process for retrieving the native data.
  • An irregular layout introduces the possibility of collisions between variable names and overview dataset names
  • A regular layout simplifies the definition of the group that contains overviews. It only needs to contain the "multiscales" attribute, and no other attributes. This means its attributes can be defined separately from the attributes that are valid for a Dataset, which ultimately simplifies the job of parsers.
  • A regular layout keeps the definition of Dataset attributes narrow, by keeping the "multiscales" key out of Dataset attributes. This reduces the risk that of a client like xarray failing to propagate the "multiscales" attribute during data processing, which might violate data integrity.
  • The irregular layout invites confusion between the original dataset and the dataset with the finest sampling. For most overviews these are the same dataset, but if you add supersampled overviews, then the original data does not have the finest sampling, and this will be a source of confusion. The regular layout prevents this scenario.

d-v-b avatar Sep 04 '25 09:09 d-v-b

Wouldn't a writer producing data for geozarr know in advance that there will be overviews?

No.

I expect in many cases the overviews are added afterward, even the "GeoZarr conventions" are added afterwards (conventions was the inital name of the spec).

I expect such approach in many contexts where the data is produced and GeoZarr, similarily to how CF conventions can be added when disseminating the products from a provider, possibly by a different stakeholder.

christophenoel avatar Sep 04 '25 09:09 christophenoel

I would like to add meteorological use cases for multi-scale geoZarr.

Daily, Met Services typically run several models which may have shared attributes or not, at several scales (global, regional, national, local/mesoscale, and microscale for tornados etc). The resolutions of the different scales are not consistent, as they depend on IT capability. Any of the scales could be considered the "original" or preferred scale for the forecasting problems at hand. And almost everything is archived for long term access. Adding new, "overview" forecasts, aka as case studies, is not unusual. Forecasting also produces ensembles of, say, fifty forecasts for the same area and time. none is preferred and all are, a priori, equally likely. So a tree structure that does not assume the "root" is the primary or highest resolution is strongly preferred. HTH.

chris-little avatar Sep 04 '25 09:09 chris-little

This use case is indeed usual but I would not consider it as an "overview case".

It resembles the handling of Sentinel-2 data, where independent dataset exists for each resolution within the hierarchy:

/S2/         
├── r10m/                       
│   ├── band1
│   ├── band2
│   └── spatial_ref
├── r20m/                       
│   ├── band1
│   ├── band2
│   └── spatial_ref

Each variable would have overviews to support browsing. (and these can later be merged and realigned into an L3 datacube if needed)

christophenoel avatar Sep 04 '25 10:09 christophenoel

For clarity, I forward the more general "GeoZarr Specifications Working Group" discussions to #93 that actually treat this aspect.

emmanuelmathot avatar Sep 04 '25 10:09 emmanuelmathot

For multiscale support, I think we need a consensus that gives flexibility to data producers:

The multiscale group may include or exclude the native data, and the child zoom-level groups may likewise include or exclude the native level (0). This flexibility allows producers to handle different scenarios, such as adding overviews later to an existing archive.

christophenoel avatar Sep 04 '25 15:09 christophenoel

For multiscale support, I think we need a consensus that gives flexibility to data producers:

I thing we should question this principle. Flexibility for data producers will result in different layouts that all comply with GeoZarr. These different layouts all convey the exact same thing (a collection of variables that have a multiscale relationship). Multiple ways of saying the exact same thing is expensive for readers, which will need to write logic to handle every possible multiscale layout. When you allow this kind of variability into the spec, you increase the likelihood for implementation skew, e.g. a reader that fails to implement a valid multiscales layout.

For this reason IMO there should be just 1 way to express a multiscale collection of variables, and it should be the way that is conceptually the simplest.

d-v-b avatar Sep 04 '25 15:09 d-v-b

Right when you define a new data format. Not when you try to reach consensus in frame of an OGC initiative that tries to reconcile many stakeholders and use cases. You may add recommendations, as any other OGC spec.

By the way, overviews are typically added in a separate process, so the simplest for me is adding the downscales in children groups. 😉

christophenoel avatar Sep 04 '25 17:09 christophenoel

Dear colleagues, all recent discussions help to put things into perspective and to better understand everyone’s expectations. But could we now focus on tangible elements for or against an option for the multiscales?

I think that breaking backward compatibility (to current version of NCZarr, xarray, GDAL) by adding an intermediate group is a valid argument in favour of the natural structure (a group dataset direct children are variables and dimensions), i.e. option B. The need to modify the structure of the data when adding overviews also seems to me a critical and valid argument supporting this option.

However, could you please help clarifying the advantages of option A? I am not sure I fully understand why it is considered more analytics-friendly and what the other arguments are.

I understand that a demonstration implementation has recently been presented based on option A conventions, but I believe it is important to be able to present a balanced view, and then obtain feedback from the OGC members. Our task is to be as factual as possible, and I make no assumptions about the possible feedback.

christophenoel avatar Sep 05 '25 06:09 christophenoel

@christophenoel, here are the advantages of Option A (uniform hierarchy with native data in child group):

Uniformity & API Consistency:

  • Every scale level uses identical access patterns: group/{level}/variable
  • No special-case logic needed for native vs overview data
  • Simpler specification language: "all scales in numbered groups" vs "all scales except native"

Modern Use Cases: The "native" dataset concept is increasingly outdated:

  • TMS Compliance: Child keys can directly map to zoom levels, enabling true web-optimized Zarr
  • AI Upsampling: Current AI can generate higher-resolution data from lower resolutions, making any resolution potentially "native"
  • Multi-model scenarios: @chris-little meteorological example shows no inherent "primary" resolution

Technical Benefits:

  • Eliminates variable name/group collisions
  • Separates multiscales metadata from dataset attributes, reducing parser complexity
  • Prevents xarray from accidentally dropping multiscales attributes during processing
  • Supports supersampled overviews without conceptual confusion

Tooling Reality: Your backward compatibility argument actually favors Option A. Xarray DataTree cannot open Option B's full hierarchy because parent/child dimensions don't align—this violates DataTree's core requirement. Option A works with both individual dataset access and potential future DataTree support.

Specification Clarity: Option A provides one clear way to express multiscale data. Option B introduces flexibility that creates implementation complexity and potential incompatibilities—exactly what specifications should avoid.

The uniformity principle @d-v-b outlined is indeed the strongest specification argument: consistent patterns reduce implementation burden and improve interoperability.

emmanuelmathot avatar Sep 05 '25 14:09 emmanuelmathot

Thank you for the summary.

christophenoel avatar Sep 05 '25 14:09 christophenoel

Briefly chiming in here...

The "native" dataset concept is increasingly outdated

This may be true for Discrete Global Grid Systems (DGGS) as well.

The reprocessing and refactoring of the product is not only an issue for existing (legacy) products, but would also be required every time overviews are added. I don’t see this as acceptable.

If an organization has petabytes of data they don't want to reprocess, that's a migration problem, not a specification design constraint. Backward compatibility matters, but it can't paralyze standardization efforts.

I'm getting outside of the scope of this issue, but I'm wondering if something like https://github.com/zarr-developers/zarr-specs/issues/287 (manifest zarr arrays) wouldn't help dealing with migration and backward compatibility concerns? Could it discharge GeoZarr a bit from those concerns in order to push the specs forward and take the most from the Zarr format?

For example we could have multiple zarr stores where only (the original) one contains the data:

  • zarr store A: original store containing the actual data as regular zarr arrays
  • zarr store B: contains almost exclusively manifest zarr arrays pointing to regular zarr arrays in store A, although here groups and metadata are organized differently
  • zarr store C: like store B with another metadata layout
  • etc.

For a data provider, supporting alternative specs (e.g., migrate to a new GeoZarr version, provide a fully CF-compliant store, etc.) without breaking changes could "just" consist in adding a new zarr store with manifest zarr arrays and a bunch of metadata, with zero data copy.

Does that sound like a possible / reasonable / acceptable solution? I'm not very familiar with zarr features such as storage transformers and I don't really have any experience as a provider of petabytes of data, so I might totally overlook the cost of duplicating the metadata of an entire data catalog (both in terms of storage and processing) as well as the performance cost of the data access indirection of zarr manifest arrays.

benbovy avatar Sep 06 '25 20:09 benbovy

The discussion is moving towards a more neutral summary, but I believe it remains opinionated and not fully accurate in places.

First, the purpose of overviews should be clear. Overviews are downscaled derivatives that allow a primary data variable to render quickly in a web viewer.

When a provider creates data (including raw sensor image, orthorectified data, processed data, IA generated data), then a new producted is released and overviews can be generated for each variable within that product.

  • Every scale level uses identical access patterns: group/{level}/variable

The advantage of the group/{level}/variable approach is that every scale level, including the primary data, follows the same access pattern. By contrast, option B treats the primary dataset differently.

However, this benefit must be weighed against the drawback that option A requires restructuring existing datasets when overviews are added later.

An Option C could define a consistent structure (group/{level}/variable) even when no overviews exist. In this case, the primary dataset would always be stored under 0/. The drawback is that all existing Zarr datasets are not valid against that new structure.

  • No special-case logic needed for native vs overview data

Both option involves special-case logicl: Option B – For reading overviews, clients must handle native data differently from overviews (zoom level 0 in the parent vs children for downscales). Option A – Clients handle all scales uniformly, but must distinguish between datasets with overviews and those without.

  • AI Upsampling: Current AI can generate higher-resolution data from lower resolutions, making any resolution potentially "native"

If an analysis or AI process produces upsampled data, that output constitutes a new product or a new variable. That new variable can also have its own overviews.

  • Multi-model scenarios: @chris-little meteorological example shows no inherent "primary" resolution

I don't think this uses cases suits the purpose of overviews.

For products that expose multiple native resolutions (for example, Sentinel-2), the product provides separate variables per resolution. Therefore, overviews should be generated for each of those variables. Note also that different resolutions often expose different band sets.

Tooling Reality: Your backward compatibility argument actually favors Option A. Xarray DataTree cannot open Option B's full hierarchy because parent/child dimensions don't align—this violates DataTree's core requirement. Option A works with both individual dataset access and potential future DataTree support.

Option B can present challenges for Xarray DataTree, since it expects alignment between parent and child dimensions. However, DataTree support is optional, and individual groups can still be accessed as Datasets. By contrast, Option A avoids this issue and offers uniform support for scale levels, but it also implies that existing Zarr datasets would need structural changes when overviews are added, rather than just metadata updates.

The uniformity principle @d-v-b outlined is indeed the strongest specification argument: consistent patterns reduce implementation burden and improve interoperability.

The pattern in Option A is internally consistent once overviews exist, since every scale level follows the same structure. However, it becomes inconsistent across datasets, because products without overviews follow a different layout. Option C has been proposed as a third alternative.

Proposed summary

Option A – group/{level}/variable including primary data

Advantages

  • For dataset with overviews, uniform access pattern across all scale levels, including the primary data.
  • Simplifies client logic by avoiding special cases between native and overview data.
  • Compatible with Xarray DataTree expectations, since each level is its own aligned dataset.

Drawbacks

  • Requires restructuring existing Zarr datasets when adding overviews, as the primary data must be moved under 0/.
  • Creates inconsistency across datasets with and without overviews (native dataset in 0/ vs directly in the parent).

Option B – Primary data in parent, overviews as children

Advantages

  • Backward compatible with existing datasets; overviews can be added later without changing structure.
  • Reflects common practice in current archives and avoids costly migration.

Drawbacks

  • Clients must treat primary data and overviews differently (parent vs children).
  • Xarray DataTree struggles with parent/child dimension misalignment, though individual groups can still be accessed as Datasets.

Option C – Always use group/{level}/variable, even without overviews

Advantages

  • Full structural consistency across datasets with or without overviews.
  • No need for restructuring when overviews are added later.

Drawbacks

  • Invalidates all existing Zarr datasets that do not already use this pattern.
  • Adds unnecessary hierarchy for simple datasets without overviews.

Please feel free to share feedback if this summary does not fully reflect your perspective. My intention is not to provide a definitive answer but to help clarify the discussion in a neutral tone.

christophenoel avatar Sep 14 '25 08:09 christophenoel

Would an overall structure like below (example for sentinel-2) reconcile the different points of view?

/measurements/
├── r10m/
│   ├── band1
│   ├── band2
│   └── spatial_ref
└── overviews/
    └── r10m/ 
        ├── 0/ 
        |   ├── band1
        |   ├── band2
        |   └── spatial_ref
        └── 1/ 
            ├── band1
            ├── band2
            └── spatial_ref

To avoid redundancy, measurements/overviews/r10m/0/band1 data would redirect to measurements/r10m/band1 data using zarr "manifest" arrays (storage transformer).

It seems to me that it has the advantages of options A/B/C without having their drawbacks. The main drawback here is that there's no kind of "manifest" array in the zarr specs yet, but this may be a good case for pushing https://github.com/zarr-developers/zarr-specs/issues/287?

benbovy avatar Sep 14 '25 10:09 benbovy

@benbovy I think that's an interesting layout, and I'm wondering if the plurality of different multiscale layouts might be a feature rather than a bug here. As long as the multiscales metadata completely describes the layout of the overviews, then many different layouts are compatible with the same API, and data producers could make their own decisions about which particular layout is right for their data / tooling.

That means the multiscales metadata should contain an explicit index or manifest of the different scale levels, to support simple access for a consumer. However, I don't think the current multiscales metadata requires a complete index of the actual layout. So maybe that's a high priority of clarification?

I don't know TMS metadata very well so I will use a very reduced schematic metadata to demonstrate how all of these layout proposals can be expressed:

A / C (group/{level}/variable including primary data)

{
    "multiscales": [{"path": "./0/variable", ...}, {"path": "./1/variable", ...}, ...]
}

B (Primary data in parent, overviews as children)

{
    "multiscales": [{"path": ".", ...}, {"path": "./1/variable", ...}, ...]
}

~Benoit's proposal (Primary data in parent, overviews as children in a special overviews group)

Since the multiscales attribute is the primary entry point for the overviews, we can simply declare the source data by reference in the manifest:

{
    "multiscales": [{"path": ".", ...}, {"path": "./overviews/1/variable", ...}, ...]
}

If we lean more on the expressiveness of the multiscales metadata, then I think we can treat these discussions of particular layouts as an implementation detail. Does this seem like a good direction?

d-v-b avatar Sep 14 '25 11:09 d-v-b