ngff icon indicating copy to clipboard operation
ngff copied to clipboard

Multiscale metadata

Open d-v-b opened this issue 2 years ago • 9 comments

(Following on discussion from https://github.com/ome/ngff/pull/85)

As schematized by @constantinpape here, the current (0.4) version of the spec results in a hierarchy like this:

image-group/
  .zgroup  <- this contains the zarr group metadata (just version atm)
  .zattrs <- this contains the group level metadata, which currently contains the ngff metadata
  scale-level0/  <- the array data for scale 0
    .zarray <- contains the zarr array metadata
    .zattrs <- contains additional metadata for the array, currently not used for ngff metadata
...

I have a few concerns with this arrangement, specifically the relationship between image-group/.zattrs:multiscales and the absence of spatial metadata in image-group/scale-level0/.zattrs

Array metadata

There is no spatial metadata (axes andcoordinateTransformations) stored in scale-level0/.zattrs. This is undesirable: first from a semantic purity standpoint, the spatial metadata for scale-level0 is a property of scale-level0, and as a general principle metadata should be located as close as possible to the thing it describes. From a practical standpoint, clients opening the array directly will have no access to spatial metadata via the array's own .zattrs. Instead, clients must first parse image-group/.zattrs:multiscales to figure out the spatial embedding of a array. Not only is is this indirect and inefficient, it's brittle: copying scale-level0 to a different group won't preserve the spatial metadata, which will inevitably lead to confusion and errors, at least as long as arrays are what clients actually access to get data.

The multiscales.coordinateTransformations attribute contributes to the brittleness. I am seriously skeptical about this attribute, and I would appreciate if someone could explain why it is necessary (as opposed to per-array coordinateTransformations, which seems much simpler and more robust). In https://github.com/ome/ngff/pull/85 @jbms suggested (and @constantinpape agreed with) a model with 3 different coordinate spaces:

* world (e.g. what is displayed in a viewer)
* multiscale
* dataset
  • I thought the role of coordinateTransformations was to map dataset coordinates to world coordinates. The existence of an additional coordinate space implies that coordinateTransformations are incomplete unless they are explicitly associated with a target coordinate space; this is currently not represented in the spec, but maybe it's on the roadmap? In any case, I think it's much simpler (and consistent with the actual semantics of data acquisition) to stipulate that there are only 2 coordinate spaces in scope: dataset coordinates (i.e., array indices), and world coordinates (physical units).
  • "multiscale" as a coordinate space is confusing.... I don't see how "multiscale" could be considered a coordinate space with any meaningful difference from "world"... multiscales.axes has physical (world) units, so I must be missing something here.

Metadata duplication

In the original zarr multiscales issue, the final proposal was extremely simple: multiscales was just a list of references to arrays, with no array metadata, and some metadata about itself (e.g., version). Clients interested in displaying a multiscale image ultimately need to know the scaling and offset of each array; to make IO a bit more efficient for these clients, several voices supported duplicating array metadata (specifically spatial metadata) and put this in multiscales. The logic was that clients would only need to perform one IO operation (fetching image-group/.zattrs) to get all the information needed about the multiscale collection, but the cost is duplicated metadata. I wonder, how awful is it for IO if we don't duplicate any metadata, and clients must first query image-group/.zattrs, parse multiscales.datasets, (which is just a list of paths), then access the metadata of each array listed in multiscales.datasets. For a typical multiscale collection with 5-8 scale levels, this means 5-8 additional fetches of JSON metadata. How bad is this for latency? If the fetches are launched concurrently, I suspect the impact would be minimal, and it will certainly be dwarfed by the time required to ultimately load chunk data. I think we should seriously consider this.

Suggestions

  1. Require that spatial metadata (axes andcoordinateTransformations) for arrays reside primarily in array metadata, e.g. scale-level0/.zattrs. When an array's spatial metadata needs to be duplicated, e.g. in image-group/.zattrs:multiscales.datasets, it should be understood that such duplication is only for convenience / performance.
  2. If we do 1., Seriously consider removing duplicated array metadata from multiscales, i.e. making multiscales.datasets just a list of references to other arrays with no array-specific metadata. Clients have to do more work to compose the multiscale, but the metadata story is much cleaner.

cc @bogovicj

d-v-b avatar Feb 18 '22 20:02 d-v-b

Regarding my description of separate world and multiscale coordinate spaces, Neuroglancer uses a similar model and I can explain the role of these different coordinate spaces in the context of Neuroglancer:

  • In Neuroglancer, cooridnate spaces have dimension names and units as they do here. But units are allowed to have arbitrary coefficients, not just powers of ten or a thousand. Effectively the unit coefficients correspond to an implicit additional scale transform applied to the input and an inverse scale transform applied to the output of the transform. These unit coefficients are used for various UI and navigation purposes.
  • For a volumetric dataset with (nominal) xyz resolution of 8x8x40nm, for example, we would typically define a multiscale coordinate space with dimensions x, y, z with units 8e-9m, 8e-9m, 40e-9m.
  • As I mentioned before, the dataset -> multiscale transforms are affine transforms that normally just specify the downsample factors and offsets. The first scale will usually have an identity or translation-only transform.
  • In Neuroglancer, data sources don't directly specify a multiscale -> world transformation; instead they specify an initial value for the multiscale -> layer transformation. The layer coordinate spaces are merged to produce a global coordinate space. In almost all cases the data source specifies an identity transform as the initial value for the multiscale -> layer transformation, because most of the existing formats don't have such a transformation. The exception is NIFTI, which can specify an affine transformation from "data space" to a canonical space. Neuroglancer automatically handles the unit coefficients, so there is no need to explicitly specify a transform to convert from e.g. 8nm to 1nm.

The dataset -> multiscale transformations are never exposed to the user directly, but the multiscale -> layer coordinate transformation is shown directly to the user (as an affine transform), and the user can edit it directly, reset it to an identity transform, etc. (Editing it only modifies the Neuroglancer state, not the original data.)

The user can also directly modify the units and coefficients of the multiscale coordinate space. This is technically redundant with just modifying the multiscale -> layer affine transformation, but when logically the goal is to "correct" the units of the data, it is much clearer to do it directly rather than modifying the affine transform.

For intermediate coordinate spaces like multiscale, while there are physical units like "meters" specified, these units are basically informational-only --- they are displayed to the user under the data source properties, when editing the multiscale -> layer transform, but are discarded since only the units of the layer coordinate space are ultimately relevant for displaying scale bars, etc.

@d-v-b mentioned that specifying the multiscale -> world (or in the case of Neuroglancer, multiscale -> layer coordinate transformation) directly would be more explicit. But I think that depends on the scenario and on your mental model: e.g. if you have acquired each scale independently with some sort of optical zoom, and then somehow independently calibrated what the resolution of each scale is, then it would indeed be most explicit to specify these scale factors directly for each scale, rather than specifying every scale factor relative to the first scale factor. But the case that I deal with more commonly is one where the microscope only acquires data at a single resolution, and then we digitally downsample it. In that case we know 100% for certain the downsample factors --- that is the "original"/"real" metadata. Then we may also have an estimate of the resolution of the data produced by the microscope, or we may just have a placeholder value. In this case it is more natural to store the downsample factors and the base resolution separately rather than always pre-multiplying them.

In general I would say there is not necessarily any single world coordinate space --- e.g. there may be multiple alignments used for different purposes, or as better quality alignments are produced. Therefore you may not wish to store any transformation as a property of the array itself.

Another possibly relevant example: You may have acquired 3-d imagery (e.g. using xray ct) at multiple "optical" resolutions --- e.g. a large overview at low resolution and multiple smaller possibly overlapping cutouts at higher resolutions. You may wish to then produce multiple digital downsample levels for each 3-d volume, and view each multiscale volume individually, and also to view the combined volume as a single multiscale volume that includes all of the scales from the individual volumes. In this case you need an additional coordinate transformation each individual multiscale volume to the combined multiscale volume.

jbms avatar Feb 18 '22 21:02 jbms

But the case that I deal with more commonly is one where the microscope only acquires data at a single resolution, and then we digitally downsample it. In that case we know 100% for certain the downsample factors --- that is the "original"/"real" metadata. Then we may also have an estimate of the resolution of the data produced by the microscope, or we may just have a placeholder value. In this case it is more natural to store the downsample factors and the base resolution separately rather than always pre-multiplying them.

Even for digital downsampling each scale level typically has a different translation applied to it, and that translation depends on the type of resampling procedure applied during downsampling. This translation should be specified explicitly in metadata. And if the translation is specified explicitly, we really should specify scale as well. Scale and translation completely specify downsampling a grid; failing to do so requires baking assumptions about downsampling routines into the format, and this should be avoided.

I understand that 99% of the time we might be doing 2x windowed averaging, and most data viewers automatically assume the offsets generated by this procedure, but this spec should not make such assumptions when being explicit is so easy. If someone wants to be weird and generates a multiscale collection by first downsampling by 2, then by 3, then by 4, while always resampling on a grid starting at 0, this should be consistent with the spec (and data viewers should be able to handle this).

In general I would say there is not necessarily any single world coordinate space --- e.g. there may be multiple alignments used for different purposes, or as better quality alignments are produced. Therefore you may not wish to store any transformation as a property of the array itself.

First, you are describing multiple transformations from dataset -> world, not multiple world coordinate spaces. I'm not sure what conditions would result in multiple world coordinate spaces. I've never worked with data with ambiguous axis semantics, maybe there are some examples?

Second, I don't think ome-ngff metadata should support multiple alignments of the same dataset. That seems way too complex. Instead, I would stipulate that there are just 2 coordinate spaces in scope for the format: dataset and world. This makes everything (in particular, transformations) much simpler, and it's very close to how most people think about their data when it comes off an instrument. This of course is just my opinion, so I would be curious to hear from other people.

d-v-b avatar Feb 18 '22 21:02 d-v-b

@d-v-b I 100% agree that we should be very explicit about the downsampling factors and translation due to the downsampling method. To me it seems most natural to represent that by specifying the downsample factors and offsets relative to the base resolution, i.e. in terms of base resolution voxels rather than some physical coordinate space, but I can see advantages to both approaches.

I think of the world coordinate space as just whatever relevant coordinate space you wish to work with for a particular task.

I would agree that in the simple case where you have a single image and you are just adding in the nominal voxel size from the microscope, so your transformation is really just specifying your units, maybe it is reasonable to always work in the same physical coordinate space and so you only need a single world coordinate space. But once we start talking about affine transforms, displacement fields, etc. then we almost surely have multiple relevant coordinate spaces that we want to be able to reference. For example, say we have several types of microscopy data of the same sample:

  • EM volume
  • Confocal data (multiple channels)
  • Lightsheet data (multiple channels)

Each channel of each of these volumes may be represented by a bunch of 2-d images, each image with its own nominal coordinate spaces from the microscope parameters. In some cases you may wish to view the 2-d images directly.

Additionally we may produce individual alignments of each channel of each volume, and then align one volume to another.

Then there may also be a "reference" coordinate system for the organism, and we may wish to align the data to this reference coordinate system.

In total we may have:

  • original voxel coordinate space of each image
  • nominal physical coordinate space of each image
  • voxel coordinate space of aligned individual volumes
  • physical coordinate space of aligned individual volumes
  • Depending on how the EM data is acquired, there may be separate intermediate alignments of portions of the data, such as individual 2-d sections, FIBSEM hot knife tabs, thick sections imaged by GCIB
  • There may be multiple versions of alignments
  • reference coordinate space for organism

Depending on the stage of processing, I can imagine that all of these coordinate spaces may be relevant for visualization and/or processing tasks.

It seems like the "spaces" proposal by @bogovicj (#94) would address all of this, though. In general it just seems more natural to me to attach coordinate transformations to a named "view", rather than to the array itself, since we can have arbitrarily many views but could attach just a single coordinate transformation to the array itself and then we also have the potential ambiguity of whether we want to refer to the "raw" array or the transformed array.

I suppose the intention may be that a view would be layered on top of a multiscale, rather than underneath it, though?

jbms avatar Feb 19 '22 03:02 jbms

Thanks for branching this out into a separate issue, @d-v-b. I didn't have time to read everything you and @jbms wrote carefully, but I'll leave a few comments:

I think this issue is bringing up three related but distinct points

1 Meaning of coordinateTransformation and spaces

In #85 @jbms suggested (and @constantinpape agreed with) a model with 3 different coordinate spaces:

The point that I wanted to make (and that @jbms also made again here) is that there can be multiple "world" coordinate spaces (e.g. different registrations, different alignments etc.) and that coordinateTransformations are not exclusively meant to go from data space to (a single) world coordinate system.

  • The existence of an additional coordinate space implies that coordinateTransformations are incomplete unless they are explicitly associated with a target coordinate space; this is currently not represented in the spec, but maybe it's on the roadmap?

Indeed, it's not complete yet, but is on the immediate roadmap and discussed in #94. I think the only actionable thing to do here is to work on #94, #101 and follow ups to extend the definition of spaces and transformations and make sure that they are clear.

2 Where do we define array specific metadata (axes and transformations)

I think this summarises the potential changes to metadata very well:

  1. Require that spatial metadata (axes andcoordinateTransformations) for arrays reside primarily in array metadata, e.g. scale-level0/.zattrs. When an array's spatial metadata needs to be duplicated, e.g. in image-group/.zattrs:multiscales.datasets, it should be understood that such duplication is only for convenience / performance.

    1. If we do 1., Seriously consider removing duplicated array metadata from multiscales, i.e. making multiscales.datasets just a list of references to other arrays with no array-specific metadata. Clients have to do more work to compose the multiscale, but the metadata story is much cleaner.

There are advantages / disadvantages to both solutions. I don't have a very strong opinion on this a priori, except for the sunken cost that all tools currently supporting ome.zarr are built with "consolidated" metadata at the group level in mind, and that this is a rather large architectural change.

3 How do we specify downscaling

Raised by @jbms:

be very explicit about the downsampling factors and translation due to the downsampling method. To me it seems most natural to represent that by specifying the downsample factors and offsets relative to the base resolution, i.e. in terms of base resolution voxels rather than some physical coordinate space, but I can see advantages to both approaches.

This point was quite extensively discussed already (sorry I can't find the exact Issue/PR for it right now) and there was a strong support for using transformations for each of the scale levels instead of downsampling factors. So I would be very much in favor of not opening this discussion again.

In summary

  1. I think the points raised here should inform the discussion in #94 and #101
  2. This would be a rather large change to the spec; others should weigh in on this, cc @joshmoore @sbesson. One strong suggestion from my side: I would not intertwine this discussion too much with 1; from experience discussing separate (even if related) issues/proposals together makes the whole process much more complicated (this is why the current v0.4 release ended up with a much simpler definition of transformations than originally intended)
  3. I am strongly in favor of not opening the discussion about how downscaling is specified again.

constantinpape avatar Feb 19 '22 12:02 constantinpape

@d-v-b, does your :+1: on @constantinpape's summary mean that we can see this issue primarily as added discussion for #94 & #101? If so, do we need to keep it open or find it again when the time comes?

A few additions from my side "inline":

https://github.com/ome/ngff/issues/102#issue-1143730060 it's brittle: copying scale-level0 to a different group won't preserve the spatial metadata, which will inevitably lead to confusion and errors, at least as long as arrays are what clients actually access to get data.

The ability to move arrays is something that has come up a few times in various contexts. At some point we should probably talk through what type of requirement this is (MUST, SHOULD, etc). It will impact various other parts of the spec like naming conventions.

https://github.com/ome/ngff/issues/102#issue-1143730060 For a typical multiscale collection with 5-8 scale levels, this means 5-8 additional fetches of JSON metadata. How bad is this for latency? If the fetches are launched concurrently, I suspect the impact would be minimal, and it will certainly be dwarfed by the time required to ultimately load chunk data

I don't have any concrete numbers but I know that the xarray community is quite convinced of the savings of zarr-level consolidated metadata.

https://github.com/ome/ngff/issues/102#issuecomment-1045233714 I understand that 99% of the time we might be doing 2x windowed averaging,

Reading through the discussion above, I do wonder if we don't consider multiscales just a short-hand moving forward for the more complete model that's being discuss. It might be that when the transforms are in we will be faced with deciding whether or not that short-hand has a place. Options I can imagine:

  • keep both as separate mechanisms
  • upgrade all multiscales to transforms
  • have multiscales and a more advanced construct extend a common base

Probably the question is if there are any MUST fix issues here. If not, perhaps we can note design guidelines/lessons that we can apply as the spec evolves.

Suggestions

  1. Require that spatial metadata (axes andcoordinateTransformations) for arrays reside primarily in array metadata, e.g. scale-level0/.zattrs. When an array's spatial metadata needs to be duplicated, e.g. in image-group/.zattrs:multiscales.datasets, it should be understood that such duplication is only for convenience / performance.
  2. If we do 1., Seriously consider removing duplicated array metadata from multiscales, i.e. making multiscales.datasets just a list of references to other arrays with no array-specific metadata. Clients have to do more work to compose the multiscale, but the metadata story is much cleaner.

In general, from my side :+1: for avoiding or at least finding strategies for duplication. As for the group metadata, I'd add to CP's point:

https://github.com/ome/ngff/issues/102#issuecomment-1046003894 (2) There are advantages / disadvantages to both solutions. I don't have a very strong opinion on this a priori, except for the sunken cost that all tools currently supporting ome.zarr are built with "consolidated" metadata at the group level in mind, and that this is a rather large architectural change.

that there will inevitably be refactorings over the next milestones which we can use to re-evaluate these layouts but that if there are no complete blockers, focusing on adding user-value to bring in more applications is probably better bang for our buck, which is maybe just another way of saying:

https://github.com/ome/ngff/issues/102#issuecomment-1046003894 summary#2 This would be a rather large change to the spec;... I would not intertwine this discussion too much with 1; from experience discussing separate (even if related) issues/proposals together makes the whole process much more complicated (this is why the current v0.4 release ended up with a much simpler definition of transformations than originally intended)

joshmoore avatar Feb 21 '22 02:02 joshmoore

@d-v-b, does your 👍 on @constantinpape's summary mean that we can see this issue primarily as added discussion for https://github.com/ome/ngff/issues/94 & https://github.com/ome/ngff/issues/101? If so, do we need to keep it open or find it again when the time comes?

Yes, as long as one (or both) of those issues takes up the question of multiscale metadata. But I'm not sure the discussion of consolidated metadata fits in either of those issues (and sorry for putting so much stuff in this issue)...

The ability to move arrays is something that has come up a few times in various contexts. At some point we should probably talk through what type of requirement this is (MUST, SHOULD, etc). It will impact various other parts of the spec like naming conventions.

Regardless of what this spec says, anyone with a standard zarr library can open an array directly, copy it, etc. So the spec should probably treat that access pattern as a given and design around it, which in my mind entails putting array metadata with arrays. That being said, it's possible to imagine wrapping array access in a higher-level API that doesn't foreground direct array access (e.g., xarray, which serializes a dataarray to a zarr group + collection of zarr arrays). But for xarray, this is necessary because the xarray data model involves multiple arrays (data and coordinates).

I don't have any concrete numbers but I know that the xarray community is quite convinced of the savings of zarr-level consolidated metadata.

Agreed, it can be a big performance win. If this is an attractive angle, the right way to approach this is to consider formally supporting metadata consolidation as a transformation of unconsolidated metadata, instead of baking consolidation into the semantics of the metadata. This might mean defining an "ome-ngff data model", which could be serialized in multiple ways (maybe just two ways: consolidated and unconsolidated metadata), as opposed to a concrete specification of how exactly the metadata in a zarr container should look. Maybe this should be yet another issue...

Reading through the discussion above, I do wonder if we don't consider multiscales just a short-hand moving forward for the more complete model that's being discuss. It might be that when the transforms are in we will be faced with deciding whether or not that short-hand has a place. Options I can imagine:

keep both as separate mechanisms upgrade all multiscales to transforms have multiscales and a more advanced construct extend a common base

Probably the question is if there are any MUST fix issues here. If not, perhaps we can note design guidelines/lessons that we can apply as the spec evolves.

I was kind of hoping that the multiscale spec would have little or no interaction with the specification of spaces and transforms. This was (I thought) the conclusion of https://github.com/zarr-developers/zarr-specs/issues/50. I don't think the underlying semantics of a multiscale collection of images is at all complicated -- it's a list of images, each with spatial metadata, with a convention for ordering (increasing grid spacing), and even this could be relaxed to a SHOULD. Spatial metadata for each image composes with this "mere list of images" idea. And I'm happy to discuss this further in the spaces / transformations issues, even if just to say "multiscales should compose with this".

All that being said, I don't think there are any MUST fix issues.

d-v-b avatar Feb 21 '22 20:02 d-v-b

@d-v-b wrote:

I was kind of hoping that the multiscale spec would have little or no interaction with the specification of spaces and transforms. This was (I thought) the conclusion of https://github.com/zarr-developers/zarr-specs/issues/50. I don't think the underlying semantics of a multiscale collection of images is at all complicated -- it's a list of images, each with spatial metadata, with a convention for ordering (increasing grid spacing), and even this could be relaxed to a SHOULD. Spatial metadata for each image composes with this "mere list of images" idea. And I'm happy to discuss this further in the spaces / transformations issues, even if just to say "multiscales should compose with this".

I think there are potentially two forms of multiscale array:

Type 1 (discrete coordinate space): The transforms between scales are strictly translation and scale-only, and furthermore these translation and scale factors are all rational numbers (with small denominators), and often powers of two. You can do useful discrete/integer indexing with this type of multiscale volume. This is by far the most common form of multiscale array in my experience, and in particular is what you normally get by digitally downsampling a single base scale.

Type 2 (continuous coordinate space): The transforms between scales are arbitrary, may involve affine transforms or even displacement fields. Integer indexing is not useful for this type of volume --- you will almost surely do everything via continuous coordinates and interpolation. This is what you might get from imaging at multiple optical zoom levels. This is strictly a generalization of type 1. (This type of multiscale volume is similar to the internal representation used by Neuroglancer.)

The current proposal, in that it allows arbitrary transformations between scales, seems to be geared towards type 2.

In my mind, it seems very natural that OME-zarr concern itself primarily with continuous coordinate space stuff, and there could be a separate zarr-multiscale standard for handling type 1. I would say though that since type 1 is by far more common, it may make sense to focus on standardizing type 1 multiscale arrays first.

I think this distinction also relates to the previous discussion of whether to represent scales in terms of downsampling factors or in terms of "absolute" physical units. For type 1 I think there is a clear case to represent the scales via rational number downsampling factors (since that preserves the ability to do discrete indexing), while for type 2 it is less clear.

jbms avatar Feb 21 '22 20:02 jbms

@jbms I'm not sure I follow the logic here. In all cases arrays are defined over a finite set of coordinates (the array indices). When these coordinates are mapped into world coordinates via an affine transform (or any other bijective transform), the cardinality of the coordinates is unchanged. A visualization tool may introduce a continuous coordinate space by rendering data at coordinates between true coordinate values via interpolation, but this is purely a concern of that tool.

d-v-b avatar Feb 21 '22 21:02 d-v-b

@d-v-b One example I have in mind is applying a neural network model that takes input patches at multiple scale levels that are supposed to be aligned to each other in a certain way, e.g. a common center position and certain relative scales, e.g. 1x1x1, 2x2x2, 4x4x4; each successive scale may have the same voxel dimensions but cover a larger physical area.

If we have a type 1 (discrete coordinate space) multiscale array, we can just check that, for each scale level required by the model, there is a scale in the multiscale array with exactly the desired downsample factors. We can then just read from these arrays without any interpolation and feed the data into the model.

If we have a type 2 (continuous) multiscale array, then it seems it would be much more difficult to apply the neural network model. We have to somehow decide which of the scale levels we want to read from (and that is not necessarily at all obvious if they are not simple scale-and-translation-only), and then we have to interpolate to get the resolution expected by the model if it does not exactly match. Furthermore, even if it is just a scale-and-translation-only transform, to decide if the resolution exactly matches what is expected by the model, we need to use floating-point arithmetic which is subject to rounding and loss of precision.

jbms avatar Feb 21 '22 21:02 jbms