ngff icon indicating copy to clipboard operation
ngff copied to clipboard

Propose resolution _groups_ for xarray support

Open joshmoore opened this issue 2 years ago • 34 comments

In discussing with the xarray community, the one change to the NGFF specification that needs to occur to prevent errors being raised when opening a multiscale is for each resolution array to live in a separate group. This has already been tested by thewtex in https://github.com/spatial-image/spatial-image-multiscale and the current spec is permissive enough to allow it. The proposal here would enforce the subdirectories moving forward.

The conflict in xarray stems from the fact that each of our subresolutions have the same dimension names ("x", "y,", etc.) but different sizes. This is not allowed in the xarray (nor NetCDF) model. An added benefit of this change is that other arrays with the same resolution levels and the same dimensions (e.g. labels!) could be stored together:

    ├── resolution-N/.zgroup
    │   ├── image/.zarray
    │   └── label/.zarray

cc: @thewtex @aurghs @malmans2 see: #48

joshmoore avatar Apr 21 '22 09:04 joshmoore

An added benefit of this change is that other arrays with the same resolution levels and the same dimensions (e.g. labels!) could be stored together

What is the advantage of this? A downside is that it couples the downsampling process for raw data to the downsampling process for labels (or any other image in the collection). Imagine if I want raw data downsampled by 2x2x2, but labels downsampled by 4x4x4, then the proposed layout becomes tricky to parse. I think it's conceptually cleaner to group by data type (raw, labels, etc) than grouping by resolution.

d-v-b avatar Apr 21 '22 13:04 d-v-b

Intensity, labels, masks sampled are often sampled on the same voxel grid. It is common to use them together, and this is helpful to identify and use this association. This pattern led to the development of xarray Dataset, which this enables.

There is not the constraint that every intensity image has to have a label image or every label image has to have an intensity image at the downsampled resolutions.

thewtex avatar Apr 22 '22 11:04 thewtex

Intensity, labels, masks sampled are often sampled on the same voxel grid. It is common to use them together, and this is helpful to identify and use this association. This pattern led to the development of xarray Dataset, which this enables.

I'm all for using xarray.Dataset, but I have some datasets with over 80 different label images. I would not want the default behavior to be treating the set of all label images as a single xarray.Dataset. Instead, I would prefer that each label image is self-contained, and I combine them into xarray.Dataset instances based on the needs of a specific application.

d-v-b avatar Apr 22 '22 12:04 d-v-b

If you prefer that each label image is self-contained, then combine them into xarray.Dataset instances based on the needs of a specific application, then you can do that. I hear you. But, that should not block people who want to store a label image together with its intensity image.

thewtex avatar Apr 25 '22 19:04 thewtex

Ah, apologies. Late to the conversation. Thanks, both. I'm interpreting @d-v-b's last :+1: to mean that he is not proposing a MAY NOT for re-use of the pyramid group but would like to keep it as a MAY (and not SHOULD). Does that sound right? If so, I'll try to clarify the language.

joshmoore avatar Apr 26 '22 07:04 joshmoore

I'm still not sure what to think here... the :+1: was to signify that in some situations someone might want to pack scale levels together. But for a specific format, like OME-NGFF, I don't see the appeal of a) packing scale levels together, and b) supporting multiple ways of representing the same thing.

On the contrary, I think the format should specify just one way to organize images, unless there's are really powerful (e.g., "representation X is impossible on storage backend Y") argument for polymorphism here. And if we are specifying just one way to organize images, I would strongly advocate an organization scheme that keeps separate multiscale images in separate folders / prefixes. This facilitates an access pattern where multiscale images are read / written independently, which I think is pretty common. Grouping by scale levels, on the other hand, facilitates an access pattern where the same scale for all images is read / written at once, which I think is pretty uncommon (and, not very scalable).

d-v-b avatar Apr 26 '22 14:04 d-v-b

b) supporting multiple ways of representing the same thing.

Guess in my mind this isn't really a new way of storing things, since it's already possible at the moment. The metadata of the datasets array is definitive on where to find the related arrays.

I would strongly advocate an organization scheme that keeps separate multiscale images in separate folders / prefixes.

(At the risk of contradicting myself) this I can see even from the metadata level. I would like to make the necessary changes to the spec so that there would only be one image (i.e. multiscale) in a zgroup.

I see @d-v-b's dilemma if I try to combine those two thoughts, since the only way would be to reference some common space outside of the current group by a "../" style reference.

joshmoore avatar Apr 26 '22 15:04 joshmoore

The metadata of the datasets array is definitive on where to find the related arrays.

IIRC, the purpose of including the path metadata in datasets was to allow a single instance of multiscale metadata to describe multiple multiscale collections within the same prefix / folder (e.g., a gaussian pyramid stored alongside a laplacian-of-gaussian pyramid), and to allow some flexibility in the names given to the different scale levels. Technically someone could use this metadata to denote a multiscale pyramid that's stored in a totally different zarr container, or a different format entirely, but this is not the intended purpose of that metadata (as far as I understand it), and so we shouldn't be bound support that usage.

d-v-b avatar Apr 26 '22 15:04 d-v-b

which I think is pretty uncommon (and, not very scalable).

In practice, storing related images, different features, different modalities, label images, masks, that are sampled on the same sampling grid, is extremely common. This is what motivated xarray Datasets and it does not sense to bother with Datasets without this organization. The Dataset organization enables simply and direct identification of volumes where pixels correspond. It has proven to be very scalable in the geospatial community.

thewtex avatar Apr 28 '22 01:04 thewtex

The Dataset organization enables simply and direct identification of volumes where pixels correspond.

Yes, this storage layout is surely convenient where sampling grids match, but I can only see this working if you constrain sampling grids to match for all images.

Here's a realistic example of images I work with:

  • raw EM data: 6nm (12nm, 24nm, ...)
  • Semantic predictions from ML network: 4nm (8nm, 16nm, ...)
  • Light micrscopy: 50nm (100nm, 200nm, ...)

None of them are on the same sampling grid. How would you store them in the scheme you are proposing?

d-v-b avatar Apr 28 '22 14:04 d-v-b

For these volumes that are not on the same sampling grid, they would not provide the useful indication that they are on sampled on the sampling grid -- they would be store in different groups, just like they are now. There is not an additional constraint that prevents them from being stored like they are now.

thewtex avatar May 02 '22 20:05 thewtex

it does not [make] sense to bother with Datasets without this organization.

I think this statement is too strong, @thewtex. I fall very much on @d-v-b's side of things here, I think it absolutely makes sense to group together datasets with different pixel spacings or even grid orientations.

Now, whether the engineering constraints on the xarray side are steadfast, or can be remedied upstream, I don't know, but my personal instinct would be to push back on that constraint a bit, rather than harden the spec one the ome-ngff side. The only place where I find this layout compelling is for multiple channels — which is probably where the geosciences applications come from?

jni avatar Jul 10 '22 11:07 jni

I think it absolutely makes sense to group together datasets with different pixel spacings or even grid orientations.

We are in agreement! It absolutely does make sense to group together datasets with different spacing or grid orientations. However, that is not a reason to push back on this documentation update in this PR. Datasets with different spacings and grid orientations can and should be able to be grouped together. That can be done independently of this documentation update, and this change does not put a constraint on the creation on this type of dataset associations. The proposal actually prevents a potential over-constraint that the pixel arrays live in the same group. This is unlikely to require any changes to existing code because accessing a group or a nested group is done the same way. Currently, the NGFF standard does not explicitly say that the pixel arrays are in the same group or in separate groups. This would explicitly say that they could be in different groups. In practice, this means that NGFF can be compatible with Xarray and NetCDF, and I think we can all agree that it is in the interest of both standards to make them compatible, if possible. The nested group in Xarray/NetCDF is motivated by a reasonable approach to a need that makes sense (store the image pixel coordinates alongside the pixel data). And the use case is similar in geosciences as in bioimaging, medical imaging, microscopy: work with multi-dimensional images as numpy arrays with the same shape: multiple frequencies, multiple sensors, derived feature images, label images. Even if someone does not use this functionality, I do not think we should be unnecessarily over-constraining NGFF in a way that makes it incompatible with Xarray/NetCDF.

thewtex avatar Jul 10 '22 18:07 thewtex

Ah, thanks @thewtex, I should have looked at the spec rather than the short summary and the ensuing discussion. As I see it, the essence of this PR is to put different resolution levels in different groups rather than different arrays within one group. (?) If one wants the groups to be singletons, that's entirely fine. (?)

Also, groups are hierarchical, (?) meaning it's totally fine to have groups of groups, ie groups of multiscale data. (?)

Given all this, I'm ok with this PR. 😅

jni avatar Jul 19 '22 11:07 jni

Also, groups are hierarchical, (?) meaning it's totally fine to have groups of groups, ie groups of multiscale data. (?)

@jni yes, that's it, sorry if the explanation was not clear. There can be additional associations of data through grouping. As we continue to make progress on the spec, we can add associations to meet needs.

thewtex avatar Jul 21 '22 14:07 thewtex

See discussion in https://github.com/Unidata/netcdf-c/issues/2474 which suggests as part of this effort (and ASAP e.g. v0.5 if not retroactively for the previous versions) _ARRAY_DIMENSIONS should be moved into the individual xarray-compatible groups or stripped entirely.

joshmoore avatar Sep 09 '22 07:09 joshmoore

See discussion in Unidata/netcdf-c#2474 which suggests as part of this effort (and ASAP e.g. v0.5 if not retroactively for the previous versions) _ARRAY_DIMENSIONS should be moved into the individual xarray-compatible groups or stripped entirely.

How about making a patch release (0.4.1) for this?

constantinpape avatar Sep 09 '22 07:09 constantinpape

How about making a patch release (0.4.1) for this?

So far, I see this proposal includes breaking changes in terms of the data layout so I don't think a patch release is an amenable option in its current form.

Semi-related, is the proposal to exclusively support the new layout i.e. have OME-NGFF 0.x fully compatible with the netcdf/xarray model. Or would we have a period of transition where both layouts would be supported? One way or another, this decision will have implications on implementations, both readers and writers.

sbesson avatar Sep 09 '22 08:09 sbesson

Thanks for the clarification @sbesson; I went through the whole discussion in more detail now and here are my thoughts:

_ARRAY_DIMENSIONS should be [...] stripped entirely.

This could be done as a patch release (and is what I was refering to), but it does not help w.r.t. compatibility with xarray.

_ARRAY_DIMENSIONS should be moved into the individual xarray-compatible groups [...]

Indeed, this is a breaking change and should not be done as a patch release (and for sure not retroactively, this would invalidate the v0.4 data that is out there!).

For the changes here: I guess we have two options:

  • adapt the layout changes proposed here for xarray compatibility, release them as v0.5 as this will be a breaking change. (where we still need to figure out what exactly to do about multiple arrays in one resolution group)
  • not going ahead with this, which would mean either not supporting xarray, or trying to ask for changes upstream so that xarray supports the ome fomat (which is unlikely to happen soon from what I recall from prev. conversations on this)

I am in favor of option 1 since I do believe xarray support is important and this is the only feasible way to get there. (Although it will need a bit of refactoring in readers and writers...)

Semi-related, is the proposal to exclusively support the new layout i.e. have OME-NGFF 0.x fully compatible with the netcdf/xarray model. Or would we have a period of transition where both layouts would be supported?

I think having a transition period would make things complicated. If we decide to go with this change, we should stick to the versioning and require 0.5 to have the new format.

constantinpape avatar Sep 09 '22 09:09 constantinpape

This pull request has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/ome-ngff-community-call-transforms-and-tables/71792/5

imagesc-bot avatar Sep 20 '22 15:09 imagesc-bot

Hey all, thought I should leave some comments here after the spec call this week.

tldr

  • I don't think this is the right model for xarray compatibility, as this makes interaction with xarray lossy
    • I'm broadly in agreement with @d-v-b on other points here.
  • I think we can get much richer xarray compatibility via xarray BackendEntrypoints (POC at bottom). This does not require _ARRAY_DIMENSIONS or the change proposed here.

This PR is more about xr.open_zarr compatibility than xarray compatibility

I'm completely onboard with xarray compatibility, but I'm not convinced allowing xr.open_zarr instead of xr.open_dataarray(..., engine="ome") or even ome.read_image(...) is valuable. I would strongly prefer the "obvious way" to do IO with OME arrays used OME dimensional information, which xr.open_zarr will not. I would suspect xarray devs would agree, as they're working towards this specific use-case:

I would note that if we define an OME backend, we may not even have to specify the engine due to the guess_can_open API for the backend interface.

netcdf compatibility

@joshmoore has brought up the topic of netcdf compatibility.

The netcdf-c library does not seem to be able to open a zarr store which isn't formatted in with the netcdf or xarray schemas Unidata/netcdf-c#2474. I don't think this incompatibility is worth changing the spec for.

  • A reader which can only read netcdf compliant stores won't be able to read the table format
  • I think it's reasonable to say a zarr implementation must (at minimum) be able to pass the test suite at zarr-developers/zarr-implementations to read an OME store.

Splitting the zarr library out of netcdf-c would resolve this.

We should control IO

I think there's a lot of value in maintaining control of IO. I actually think we can do a better job at "xarray compatibility" if we do.

  • Instead of needing explicit arrays for labeling dimensions, we could use implicit ones with the new xarray index types
  • The tables group could (eventually) be loaded as part of an xarray.DataTree (without having to update the schema)
  • We should be doing validation at IO time, especially O

If anything, I'd say the above makes us MORE compatible with xarray, since we'd be able to provide deeper integration with its features.

You could say that we can have our own controlled IO and also be compatible with xarray's zarr schema, but I still think that'd be a bad option. Why go out of our way to allow reading and writing that uses an orthogonal coordinate system? Why make it easy to read the file wrong?

Quick demo

As a quick demo of what this could look like, with some very hacky code:

OMEBackend class definition

Workaround from: https://github.com/aurghs/ome-datatree/blob/a8cb7729156b0ec7b09e909cb0d4e43ddfc200f3/ome_datatree/ome_datatree.py#L23-L36

import zarr, xarray as xr
from xarray.backends import ZarrStore, BackendEntrypoint

from collections import namedtuple


DummyStore = namedtuple("DummyStore", ("zarr_group",))

def open_ome_array(zarr_array: zarr.Array):
    from xarray.core import indexing
    from xarray import Variable
    from xarray.backends.zarr import ZarrArrayWrapper
    
    parent_pth, name = zarr_array.path.rsplit(sep="/", maxsplit=1)
    parent = zarr.Group(zarr_array.store)[parent_pth]
    store = DummyStore(parent)

    data = indexing.LazilyIndexedArray(ZarrArrayWrapper(name, store))

    # TODO: do a better job of grabbing metadata
    dimensions = [dim["name"] for dim in parent.attrs["multiscales"][0]["axes"]]
    attributes = dict(zarr_array.attrs)
    attributes.update(dict(parent.attrs))

    encoding = {
        "chunks": zarr_array.chunks,
        "preferred_chunks": dict(zip(dimensions, zarr_array.chunks)),
        "compressor": zarr_array.compressor,
        "filters": zarr_array.filters,
    }
    # _FillValue needs to be in attributes, not encoding, so it will get
    # picked up by decode_cf
    if getattr(zarr_array, "fill_value") is not None:
        attributes["_FillValue"] = zarr_array.fill_value

    return xr.DataArray(Variable(dimensions, data, attributes, encoding), name=name)


class OMEBackend(BackendEntrypoint):
    def open_dataset(
        self,
        filename_or_obj,
        *,
        drop_variables=None,
    ):
        assert isinstance(filename_or_obj, zarr.Array)
        assert drop_variables is None

        data_array = open_ome_variable(filename_or_obj)

        return xr.Dataset(
            {data_array.name: data_array}
        )
z_remote = zarr.open(
    "https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.4/idr0076A/10501752.zarr",
    mode="r"
)
da = xr.open_dataarray(z_remote["labels/0/0"], engine=OMEBackend)
da

Which opens a backed DataArray that used the OME metadata for it's dimensions and attrs:

image

I think this can be made quite powerful with the new coordinate systems, and will look into extending it once there are examples.

ivirshup avatar Oct 08 '22 21:10 ivirshup

Hey @ivirshup,

Thanks for sharing your thoughts and code.

This PR is more about xr.open_zarr compatibility than xarray compatibility

This PR is about compatibility with netCDF. xarray and xr.open_zarr compatibilty come for free.

xarray Datasets are based on netCDF groups. And the OME multiscale images can mapped to the proposed higher order xarray.DataTree in a natural way.

An OME xarray.backend is a good idea; great job on a draft implementation. And, this standard clarification avoids unnecessary complexity in that implementation.

Yes, OME-NGFF and netCDF are different standards that do not overlap 100%. However, we should strive for compatibility when possible. We will not have 100% NGFF functionality. That does not mean that the functionality that results is not valuable. Few, if any, single piece of software implements 100% of the functionality of even the current relatively minimal OME-NGFF standard: high-content screening, axes, bioformat2raw.layout, coordindateTransformations, multiscales, omero, labels, image-label, plate well. That does not mean the current ecosystem of software striving for OME-NGFF support does not have value.

The value of standards means that we do not need to control all the related software. Indeed, this is an extremely important quality because it allows the ecosystem to flourish. And everyone benefits as a result.

Beyond xarray, a sampling of other software tools supporting the NetCDF standard:

ANDX (ARM NetCDF Data eXtract) and ANAX (ARM NetCDF ASCII eXtract)
ANTS (ARM NetCDF Tool Suite)
ARGOS (interActive thRee-dimensional Graphics ObServatory)
CDAT (Climate Data Analysis Tool)
CDFconvert (Convert netCDF to RPN and GEMPAK Grids)
cdfsync (network synchronization of netCDF files)
CDO (Climate Data Operators)
CIDS Tools
CSIRO MATLAB/netCDF interface
EPIC
Excel Use
EzGet
FAN (File Array Notation)
FERRET
FIMEX (File Interpolation, Manipulation, and EXtraction)
FWTools (GIS Binary Kit for Windows and Linux)
GDAL (Geospatial Data Abstraction Library)
GDL (GNU Data Language)
Gfdnavi (Geophysical fluid data navigator)
Gliderscope
GMT (Generic Mapping Tools)
Grace
GrADS (Grid Analysis and Display System)
Gri
GXSM - Gnome X Scanning Microscopy project
HDF (Hierarchical Data Format) interface
HDF-EOS to netCDF converter
HIPHOP (Handy IDL-Program for HDF-Output Plotting)
HOPS (Hyperslab OPerator Suite)
iCDF (imports chromatographic netCDF data into MATLAB)
IDV (Integrated Data Viewer)
Ingrid
Intel Array Visualizer
IVE (Interactive Visualization Environment)
JSON format with the ncdump-json utility
Java interface
Kst (2D plotting tool)
Labview interface
MBDyn (MultiBody Dynamics)
Max_diff_nc
MeteoExplorer
MeteoInfo
MexEPS (MATLAB interface)
MEXNC and SNCTOOLS (a MATLAB interface)
Mirone (Windows MATLAB-based display)
ncBrowse (netCDF File Browser)
nccmp (netCDF compare)
ncdx (netCDF for OpenDX)
ncensemble (command line utility to do ensemble statistics)
NCL (NCAR Command Language)
NcML-Java Binding
NCO (NetCDF Operators)
ncregrid
nctoolbox (a MATLAB common data model interface)
NCSTAT
ncview
ncvtk
NetCDF Ninja
netcdf tools
netcdf4excel (add-in for MS Excel)
NetCDF95 alternative Fortran API
Objective-C interface
Octave interface
Octave interface (Barth)
OPeNDAP (formerly DODS)
OpenDX (formerly IBM Data Explorer)
Panoply
PnetCDF
Paraview and vtkCSCSNetCDF
Perl interfaces
PolyPaint+
Pomegranate
Pupynere (PUre PYthon NEtcdf REader)
PyNGL and PyNIO
Python interfaces
QGIS (Quantum GIS)
R interface
Ruby interface
Scientific DataSet (SDS) Library
Apache Spatial Information System (SIS)
Tcl/Tk interfaces
Tcl-nap (N-dimensional array processor)
Visual Basic and VB.net
VisAD
Weather and Climate Toolkit (WCT)
WebWinds
xdfv (A slick NetCDF/HDF4/HDF5 contents viewer with developers in mind)
xray (Python N-D labelled arrays)
Zebra

Wouldn't it be cool if we could open OME-NGFF images, even in a basic way, without having to fork and hack and control each one? And wouldn't it be cool if the community of researchers using these tools could use software tools from the OME-NGFF community, even in just a basic way?

I strongly think we should unnecessarily avoid resisting compatibility with other standards, software tools and research communities.

thewtex avatar Oct 20 '22 20:10 thewtex

This PR is about compatibility with netCDF.

I disagree with this. The title, branch name, and description are pretty specific to xarray – as is most of the discussion. The referenced issue is titled: "Compatibility with xarray".


Wouldn't it be cool if we could open OME-NGFF images,

I agree compatibility with existing tools is useful, however:

  1. I think how useful is a fair question. Like, what specific features are we getting here. xarray (xray on the list provided) is a great example here. Compatibility with xarray would be extremely useful, and worth considering, but we don't need netCDF compat for that.

  2. I also have strong suspicions about whether many of the tools listed actually work with a zarr netcdf store.

  3. My current understanding of plans for the ome-ngff spec:

  • You can't store different images in the same multiscales group
  • Label masks won't be stored alongside images

My understanding is that netcdf uses arrays being stored in the same groups to indicate that they should be used together (e.g. in an xarray.Dataset). I think this limits how useful having a netcdf compatible tool read directly from an ome-zarr store can be.


The value of standards means that we do not need to control all the related software. Indeed, this is an extremely important quality because it allows the ecosystem to flourish. And everyone benefits as a result.

I agree, but I think a lot of the potential here is actually realized from building on a standard like zarr, rather than going to a standard on top of zarr.


Alternative vision for netcdf compat

I think it would be quite easy to create a view of an ome-zarr store that was compatible with netcdf usage. This could be done with references (e.g. symlinks) and metadata transformations. Not so different from:

 > [I] prefer that each label image is self-contained, then combine them into xarray.Dataset instances based on the needs of a specific application

But with the added benefit that the format itself can keep the "one way to store an image", while having broader compatibility with netcdf.

Alternative alternative vision for netcdf compat

Another vision would be to go full on netCDF, and layer all of OME-NGFF on top of it. I would assume this conversation has happened before here.

ivirshup avatar Oct 22 '22 14:10 ivirshup

The changes proposed in this PR (storing scale levels in separate groups) opens up two possibilities:

  • We can put multiple images that happen to have the same coordinates in one directory.

As I noted earlier in this issue, I don't love this idea. I stand by the principle that we should have just 1 way of organizing multiscale images, and it should be a way that isolates different multiscale images from each other.

I actually quite like this idea, but it would be a radical departure from the ongoing conversations about transformation metadata (where the assumption is that transformations, and thus coordinates, are defined in JSON metadata). It seems premature to change the spec to support coordinate arrays before there's concrete proposal to actually use coordinate arrays for OME-NGFF. I should note that we only get xarray compatibility "for free" if we use exactly their zarr encoding, and the real blocker for that is the absence of coordinate arrays in OME-NGFF. I would love free compatibility with xarray but this PR doesn't actually bring us closer to it unless we have xarray-compatible coordinates.

@thewtex is there anything I'm missing here? Specifically, is there any way to get xarray compatibility without using coordinate arrays in OME-NGFF?

d-v-b avatar Oct 23 '22 20:10 d-v-b

As I noted earlier in this issue, I don't love this idea. I stand by the principle that we should have just 1 way of organizing multiscale images, and it should be a way that isolates different multiscale images from each other.

I would strongly agree with this point

I actually quite like this idea, but it would be a radical departure from the ongoing conversations about transformation metadata (where the assumption is that transformations, and thus coordinates, are defined in JSON metadata)... and the real blocker for that is the absence of coordinate arrays in OME-NGFF. I would love free compatibility with xarray but this PR doesn't actually bring us closer to it unless we have xarray-compatible coordinates.

I believe the coordinates and displacements transformations kinda allow this, but it isn't strictly compatible with netcdf. The coordinates can live anywhere (the metadata just needs to a path), and you don't need to use coordinates.

ivirshup avatar Oct 27 '22 14:10 ivirshup

The title, branch name, and description are pretty specific to xarray

I think how useful is a fair question.

There are many mentions of xarray. It is worth inspecting why,

  1. The proposal is to place images in groups?
  2. What is the motivation?

Regarding 1), by placing the image pixel array and metadata in a common group, we gain compatibility with the netCDF groups. And xarray is based on the netCDF data model:

Xarray provides two primary data structures, the xarray.DataArray and the xarray.Dataset

Xarray’s highest-level object is currently an xarray.Dataset, whose data model echoes that of a single netCDF group.

and extensions to Xarray's data model, labeled arrays without coordinates, and hierarchical data, the xarray DataTree, intentionally are compatible with the netCDF data model,

WIP implementation of a tree-like hierarchical data structure for xarray.

This aims to create the data structure discussed in xarray issue #4118, and therefore extend xarray's data model to be able to handle arbitrarily nested netCDF4 groups.

The approach used here is based on benbovy's DatasetNode example - the basic idea is that each tree node wraps a up to a single xarray.Dataset.

Note that nodes of the tree are xarray.Dataset's and not xarray.DataArray's.

Deviation from the netCDF data model means deviation the standard model used by the geospatial research community. This model has been around for decades, is used by other software, and does its job well. It is a standard and a community of existing data and software that support it. There are other Python libraries supporting netCDF, and other software built on the netCDF C library, netCDF Java library.

Regarding 2), maybe some folks are only interested in possible use of xarray as another Python library. Speaking for myself at least, I am interested in broader compatibility between open data and open source software, xarray and beyond, between the geospatial research community and the scientific software research communities. I would like to the ability for the same software developed for climate research in cancer research and vice versa. Many of the algorithms developed do not care whether the pixels come from clouds or cells.

This means compatibility between the OME-NGFF and netCDF data models. We get a lot of value by being able to load pixel data and dimension names. This is why the labelled array without coordinates is mentioned and is considered. Many times, just a pixel data array goes a long way.

By placing the image in a group, we gain compatibility of pixel data array between OME-NGFF and netCDF.

create a view of an ome-zarr store that was compatible with netcdf usage. This could be done with references (e.g. symlinks) and metadata transformations.

Yes, this alternative is worth considering, and it adds unnecessary complexity for just accessing pixel data. And implementation across all the software that supports the netCDF data model is not scalable or sustainable.

Another vision would be to go full on netCDF, and layer all of OME-NGFF on top of it. I would assume this conversation has happened before here.

We do not want to shoehorn all of OME-NGFF into netCDF, but that is not what is proposed.


I also have strong suspicions about whether many of the tools listed actually work with a zarr netcdf store.

Many tools are based on the Unidata netCDF C library, which is getting zarr support, as previously mentioned, and that will trickle down.


I agree, but I think a lot of the potential here is actually realized from building on a standard like zarr, rather than going to a standard on top of zarr.

This approach was taken many times in the TIFF ecosystem -- parties came along, built on the TIFF format, and created their own data models that did not share common tags. Sure, transformations could be implemented. But this causes unnecessary pain when common information is desired. Wikipedia's characterization of TIFF:

TIFF is a complex format, defining many tags of which typically only a few are used in each file. This led to implementations supporting many varying subsets of the format, a situation that gave rise to the joke that TIFF stands for Thousands of Incompatible File Formats.

We should seek compatibility when possible and appropriate. Loading pixel arrays is important.


it would be a radical departure from the ongoing conversations about transformation metadata (where the assumption is that transformations, and thus coordinates, are defined in JSON metadata). It seems premature to change the spec to support coordinate arrays before there's concrete proposal to actually use coordinate arrays for OME-NGFF.

@d-v-b I agree with you. We should not bring in xarray's support for netCDF data coordinates into OME-NGFF unless it is appropriate.

And this proposal does not add coordinates to OME-NGFF.

We can store coordinate arrays alongside image data, like xarray's zarr encoding.

Note that xarray can save and load Dataset's with and without coordinate arrays. They are not required.


Specifically, is there any way to get xarray compatibility without using coordinate arrays in OME-NGFF?

Yes!

  1. Merge this PR. :-). For pixel array data and dimension labels.

  2. @ivirshup has the excellent idea to create an xarray OME backend, which could populate coordinate arrays based on the OME spatial transformation metadata.

thewtex avatar Oct 27 '22 22:10 thewtex

Note that xarray can save and load Dataset's with and without coordinate arrays. They are not required.

Coordinate arrays are required if you want xarray to know about the coordinates of your data. Loading OME-NGFF data into xarray without the coordinates (specified implicitly via coordinateTransformations) would not be a good use of xarray.

Specifically, is there any way to get xarray compatibility without using coordinate arrays in OME-NGFF?

Yes!

1. Merge this PR. :-). For pixel array data and dimension labels.

2. @ivirshup has the excellent idea to create an xarray OME backend, which could populate coordinate arrays based on the OME spatial transformation metadata.

I believe that only option 2 is needed, because the hypothetical xarray OME backend would be free to create xarray.Dataset instances from collections of zarr arrays, so putting images in separate groups would not be needed.

In fact, I think we should aim for a situation where either OME-NGFF images are read correctly by xarray (e.g., with coordinates), or not at all, with no middle ground. Without including either a) xarray-compatible coordinate arrays, or b) the assurance of an OME-NGFF backend for xarray, this PR enables lossy deserialization of OME-NGFF images by xarray which could lead to massive confusion -- for example, after this PR, someone might load an OME-NGFF collection into xarray, generate some coordinates (because the OME-NGFF transformations were ignored by xarray), and then use Dataset.to_zarr(), without noting that this generates an invalid OME-NGFF container. Until the xarray support is complete and lossless, we shouldn't encourage people to use it with OME-NGFF data.

d-v-b avatar Oct 27 '22 23:10 d-v-b

This proposal supports use of Xarray, including corrordinates, correctly. It is not necessary to only support loading Xarray through overly complex transformations that only works in specific implementations and intentionally diverges from conventions of the geospatial community.

Also, coordinate arrays are not required by all use cases in Xarray. The fact is, you can write and read xarray Datasets without coordinates. This is what motivates a labeled array without coordinates in a simplified version of the Xarray package proposed in the medical imaging community (nibabel). We should look to support this use case.

Without including either a) xarray-compatible coordinate arrays, or b) the assurance of an OME-NGFF backend for xarray, this PR enables lossy deserialization of OME-NGFF images by xarray which could lead to massive confusion

This is not correct. There is not going to be massive confusion if coords are not present. There could be confusion if there were coords loaded and they did not have the right values. But that is not the case.

use Dataset.to_zarr(), without noting that this generates an invalid OME-NGFF container.

Dataset.to_zarr is not going to automatically generate a valid OME-NGFF with or without this proposal.

thewtex avatar Oct 31 '22 18:10 thewtex

Also, coordinate arrays are not required by all use cases in Xarray.

As a frequent xarray user, I'm a little skeptical of this claim. Yes, technically you can have dimensions without coordinates, but In my experience, coordinates are the key feature of xarray, and I simply wouldn't use the library if I didn't want coordinates. And I find it very hard to imagine doing anything useful with multiscale images in xarray without coordinates, because there will be no way to relate different scale levels to one another. So, speaking for myself, if xarray could natively load OME-NGFF multiscale images but not generate coordinates, this would not be terribly useful unless coordinates were handled correctly.

This proposal supports use of Xarray, including corrordinates, correctly.

Can you explain how this proposal handles coordinates correctly? As I understand it, there are only two ways to get xarray-compatible coordinates for OME-NGFF (explicit coordinate arrays or an OME-NGFF backend for xarray)

d-v-b avatar Nov 01 '22 01:11 d-v-b