ngff icon indicating copy to clipboard operation
ngff copied to clipboard

Compatibility with xarray

Open thewtex opened this issue 3 years ago • 22 comments

With the aspiration for OME-Zarr to be The One Imaging Format to Rule them All 💍 , I would like to propose compatibility with xarray. For the most part, the needs of the:

  • bioimaging
  • geospatial imaging
  • medical imaging
  • many other scientific imaging domains

overlap. A common, well-supported standard will facilitate integration and cross-pollination across communities, and avoid those I/O headaches 🤯 .

In summary, we could extend the current OME-Zarr spec to be compatible with the result of xarray.Dataset.to_zarr, in a way that adds spatial metadata, addressing #28 #12, through the xarray encoded coords using scientific imaging dimensions, x, y, z, c, t, standard in OME-Zarr, for the xarray array dimensions, making their name and order explicit #35.

Resulting consolidated metadata from idr0094
{
    "metadata": {
        ".zattrs": {
            "multiscales": [
                {
                    "datasets": [
                        {
                            "path": "0/idr0094"
                        },
                        {
                            "path": "1/idr0094"
                        },
                        {
                            "path": "2/idr0094"
                        },
                        {
                            "path": "3/idr0094"
                        },
                        {
                            "path": "4/idr0094"
                        },
                        {
                            "path": "5/idr0094"
                        }
                    ],
                    "name": "idr0094",
                    "version": "0.1"
                }
            ]
        },
        ".zgroup": {
            "zarr_format": 2
        },
        "0/.zattrs": {},
        "0/.zgroup": {
            "zarr_format": 2
        },
        "0/c/.zarray": {
            "chunks": [
                3
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "lz4",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "<u4",
            "fill_value": null,
            "filters": null,
            "order": "C",
            "shape": [
                3
            ],
            "zarr_format": 2
        },
        "0/c/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "c"
            ]
        },
        "0/idr0094/.zarray": {
            "chunks": [
                270,
                540,
                2
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "zstd",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "|u1",
            "fill_value": null,
            "filters": null,
            "order": "C",
            "shape": [
                1080,
                1080,
                3
            ],
            "zarr_format": 2
        },
        "0/idr0094/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "y",
                "x",
                "c"
            ],
            "direction": [
                [
                    1.0,
                    0.0
                ],
                [
                    0.0,
                    1.0
                ]
            ],
            "ranges": [
                [
                    0.0,
                    255.0
                ],
                [
                    0.0,
                    255.0
                ],
                [
                    0.0,
                    255.0
                ]
            ]
        },
        "0/x/.zarray": {
            "chunks": [
                1080
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "lz4",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "<f8",
            "fill_value": "NaN",
            "filters": null,
            "order": "C",
            "shape": [
                1080
            ],
            "zarr_format": 2
        },
        "0/x/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "x"
            ]
        },
        "0/y/.zarray": {
            "chunks": [
                1080
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "lz4",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "<f8",
            "fill_value": "NaN",
            "filters": null,
            "order": "C",
            "shape": [
                1080
            ],
            "zarr_format": 2
        },
        "0/y/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "y"
            ]
        },
        "1/.zattrs": {},
        "1/.zgroup": {
            "zarr_format": 2
        },
        "1/c/.zarray": {
            "chunks": [
                3
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "lz4",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "<u4",
            "fill_value": null,
            "filters": null,
            "order": "C",
            "shape": [
                3
            ],
            "zarr_format": 2
        },
        "1/c/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "c"
            ]
        },
        "1/idr0094/.zarray": {
            "chunks": [
                64,
                64,
                64
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "zstd",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "|u1",
            "fill_value": null,
            "filters": null,
            "order": "C",
            "shape": [
                540,
                540,
                3
            ],
            "zarr_format": 2
        },
        "1/idr0094/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "y",
                "x",
                "c"
            ],
            "direction": [
                [
                    1.0,
                    0.0
                ],
                [
                    0.0,
                    1.0
                ]
            ],
            "ranges": [
                [
                    0.0,
                    255.0
                ],
                [
                    0.0,
                    255.0
                ],
                [
                    0.0,
                    255.0
                ]
            ]
        },
        "1/x/.zarray": {
            "chunks": [
                540
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "lz4",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "<f8",
            "fill_value": "NaN",
            "filters": null,
            "order": "C",
            "shape": [
                540
            ],
            "zarr_format": 2
        },
        "1/x/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "x"
            ]
        },
        "1/y/.zarray": {
            "chunks": [
                540
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "lz4",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "<f8",
            "fill_value": "NaN",
            "filters": null,
            "order": "C",
            "shape": [
                540
            ],
            "zarr_format": 2
        },
        "1/y/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "y"
            ]
        },
        "2/.zattrs": {},
        "2/.zgroup": {
            "zarr_format": 2
        },
        "2/c/.zarray": {
            "chunks": [
                3
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "lz4",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "<u4",
            "fill_value": null,
            "filters": null,
            "order": "C",
            "shape": [
                3
            ],
            "zarr_format": 2
        },
        "2/c/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "c"
            ]
        },
        "2/idr0094/.zarray": {
            "chunks": [
                64,
                64,
                64
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "zstd",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "|u1",
            "fill_value": null,
            "filters": null,
            "order": "C",
            "shape": [
                270,
                270,
                3
            ],
            "zarr_format": 2
        },
        "2/idr0094/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "y",
                "x",
                "c"
            ],
            "direction": [
                [
                    1.0,
                    0.0
                ],
                [
                    0.0,
                    1.0
                ]
            ],
            "ranges": [
                [
                    0.0,
                    255.0
                ],
                [
                    0.0,
                    255.0
                ],
                [
                    0.0,
                    255.0
                ]
            ]
        },
        "2/x/.zarray": {
            "chunks": [
                270
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "lz4",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "<f8",
            "fill_value": "NaN",
            "filters": null,
            "order": "C",
            "shape": [
                270
            ],
            "zarr_format": 2
        },
        "2/x/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "x"
            ]
        },
        "2/y/.zarray": {
            "chunks": [
                270
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "lz4",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "<f8",
            "fill_value": "NaN",
            "filters": null,
            "order": "C",
            "shape": [
                270
            ],
            "zarr_format": 2
        },
        "2/y/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "y"
            ]
        },
        "3/.zattrs": {},
        "3/.zgroup": {
            "zarr_format": 2
        },
        "3/c/.zarray": {
            "chunks": [
                3
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "lz4",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "<u4",
            "fill_value": null,
            "filters": null,
            "order": "C",
            "shape": [
                3
            ],
            "zarr_format": 2
        },
        "3/c/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "c"
            ]
        },
        "3/idr0094/.zarray": {
            "chunks": [
                64,
                64,
                64
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "zstd",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "|u1",
            "fill_value": null,
            "filters": null,
            "order": "C",
            "shape": [
                135,
                135,
                3
            ],
            "zarr_format": 2
        },
        "3/idr0094/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "y",
                "x",
                "c"
            ],
            "direction": [
                [
                    1.0,
                    0.0
                ],
                [
                    0.0,
                    1.0
                ]
            ],
            "ranges": [
                [
                    0.0,
                    252.0
                ],
                [
                    0.0,
                    252.0
                ],
                [
                    0.0,
                    252.0
                ]
            ]
        },
        "3/x/.zarray": {
            "chunks": [
                135
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "lz4",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "<f8",
            "fill_value": "NaN",
            "filters": null,
            "order": "C",
            "shape": [
                135
            ],
            "zarr_format": 2
        },
        "3/x/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "x"
            ]
        },
        "3/y/.zarray": {
            "chunks": [
                135
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "lz4",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "<f8",
            "fill_value": "NaN",
            "filters": null,
            "order": "C",
            "shape": [
                135
            ],
            "zarr_format": 2
        },
        "3/y/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "y"
            ]
        },
        "4/.zattrs": {},
        "4/.zgroup": {
            "zarr_format": 2
        },
        "4/c/.zarray": {
            "chunks": [
                3
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "lz4",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "<u4",
            "fill_value": null,
            "filters": null,
            "order": "C",
            "shape": [
                3
            ],
            "zarr_format": 2
        },
        "4/c/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "c"
            ]
        },
        "4/idr0094/.zarray": {
            "chunks": [
                64,
                64,
                64
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "zstd",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "|u1",
            "fill_value": null,
            "filters": null,
            "order": "C",
            "shape": [
                67,
                67,
                3
            ],
            "zarr_format": 2
        },
        "4/idr0094/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "y",
                "x",
                "c"
            ],
            "direction": [
                [
                    1.0,
                    0.0
                ],
                [
                    0.0,
                    1.0
                ]
            ],
            "ranges": [
                [
                    0.0,
                    182.0
                ],
                [
                    0.0,
                    182.0
                ],
                [
                    0.0,
                    182.0
                ]
            ]
        },
        "4/x/.zarray": {
            "chunks": [
                67
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "lz4",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "<f8",
            "fill_value": "NaN",
            "filters": null,
            "order": "C",
            "shape": [
                67
            ],
            "zarr_format": 2
        },
        "4/x/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "x"
            ]
        },
        "4/y/.zarray": {
            "chunks": [
                67
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "lz4",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "<f8",
            "fill_value": "NaN",
            "filters": null,
            "order": "C",
            "shape": [
                67
            ],
            "zarr_format": 2
        },
        "4/y/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "y"
            ]
        },
        "5/.zattrs": {},
        "5/.zgroup": {
            "zarr_format": 2
        },
        "5/c/.zarray": {
            "chunks": [
                3
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "lz4",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "<u4",
            "fill_value": null,
            "filters": null,
            "order": "C",
            "shape": [
                3
            ],
            "zarr_format": 2
        },
        "5/c/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "c"
            ]
        },
        "5/idr0094/.zarray": {
            "chunks": [
                64,
                64,
                64
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "zstd",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "|u1",
            "fill_value": null,
            "filters": null,
            "order": "C",
            "shape": [
                33,
                33,
                3
            ],
            "zarr_format": 2
        },
        "5/idr0094/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "y",
                "x",
                "c"
            ],
            "direction": [
                [
                    1.0,
                    0.0
                ],
                [
                    0.0,
                    1.0
                ]
            ],
            "ranges": [
                [
                    0.0,
                    116.0
                ],
                [
                    0.0,
                    116.0
                ],
                [
                    0.0,
                    116.0
                ]
            ]
        },
        "5/x/.zarray": {
            "chunks": [
                33
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "lz4",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "<f8",
            "fill_value": "NaN",
            "filters": null,
            "order": "C",
            "shape": [
                33
            ],
            "zarr_format": 2
        },
        "5/x/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "x"
            ]
        },
        "5/y/.zarray": {
            "chunks": [
                33
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "lz4",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "<f8",
            "fill_value": "NaN",
            "filters": null,
            "order": "C",
            "shape": [
                33
            ],
            "zarr_format": 2
        },
        "5/y/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "y"
            ]
        }
    },
    "zarr_consolidated_format": 1
}

Created with this script.

In this example, the array dimensions are y, x, c, i.e. not all 5 dimensions in the current standard, and in a different order. But, these differences could be removed.

After attempting a few variations on this and putting it into practice, this seems to work well.

Each scale can be used independently. Initially, I tried to avoid the use of coords and use the more concise spatial-dimension rank spacing / scale, origin / translation. However, I found that in an array-based computing environment like scientific Python, where slicing is a bread-and-butter operation, the natural validity of 1D coords that can be sliced is helpful. And, in the development of visualization tools, this is quite handy and avoids on-demand generation. The logic for transforming the spatial metadata is here.

Additionally, there is a growing xarray community, and compatibility helps everyone. Added as an attr is a direction / orientation matrix, which is important in medical imaging.

I am interested in everyone's thoughts. I am grossly behind on GitHub notifications, but I will check in with the discussion on this issue every day or two.

CC @joshmoore @lassoan @rabernat @constantinpape @danielballan @forman

thewtex avatar May 05 '21 02:05 thewtex

I would like to propose compatibility with xarray

Another taker! See https://github.com/ome/ngff/pull/39#issuecomment-802622758 before it became gh-46.

cc: @constantinpape

joshmoore avatar May 05 '21 07:05 joshmoore

Thanks for raising this issue @thewtex. I fully agree that we should strive to be compatible with xarray.

To this end, if I understand it correctly, your first proposal would be to add coords to the multiscales metadata, corresponding to the Coordinate definition in the xarray terminology. In addition you would drop the requirement that image arrays must be 5d.

This is very similar to the changes proposed in #46, except that we call the coordinate labels axes in that proposal. Your input on that PR would be very much appreciated.

Additionally, there is a growing xarray community, and compatibility helps everyone. Added as an attr is a direction / orientation matrix, which is important in medical imaging.

I am not quite sure what you mean by attr here; would this correspond to additional fields in the multiscale metadata? Note that there is a related discussion about how to specify image transformations and orientations in #28.

constantinpape avatar May 05 '21 13:05 constantinpape

To this end, if I understand it correctly, your first proposal would be to add coords to the multiscales metadata, corresponding to the Coordinate definition in the xarray terminology. In addition you would drop the requirement that image arrays must be 5d.

Yes. These are approaches that would facilitate compatibility with a single-scale xarray image dataset sampled on a uniform grid, i.e. the scientific images. We could integrate them piece-by-piece or as a whole. Not listed is how to ensure compatibility with a multi-scale xarray dataset. But, that has not been defined per https://github.com/pydata/xarray/issues/4118 . Ideally, the Xarray community would also be open to compatibility with the NGFF spec with regards to multi-scales.

This is very similar to the changes proposed in #46, except that we call the coordinate labels axes in that proposal. Your input on that PR would be very much appreciated.

Nice work! I added a few thoughts.

I am not quite sure what you mean by attr here; would this correspond to additional fields in the multiscale metadata?

This was referring to xarray.DataArray.attrs which are encoded as zarr attrs. The orientation of an image does not change across scales, so we do not necessarily need an orientation matrix per-scale. But, it is present so each scale can be treated independently. That said, we may want to store it in array for precision reasons as noted in the transformation discussion.

Note that there is a related discussion about how to specify image transformations and orientations in #28.

Thanks for the note, I added a few comments there. The xarray coords encode pixel spacing and origin / offset. This could also be viewed as a scale and translation.

thewtex avatar May 06 '21 03:05 thewtex

Ideally, the Xarray community would also be open to compatibility with the NGFF spec with regards to multi-scales.

Multi-scale data is definitely a goal for Xarray (see https://github.com/pydata/xarray/issues/4118). In fact, the latest CZI EOSS proposal we are planning to submit to the current funding opportunity focuses precisely on this feature. Your feedback on that issue would be welcome and helpful. Particularly interested in how you define "compatibility with the NGFF spec". What would that mean specifically for xarray, besides just supporting generic image pyramids? Obviously Xarray will be reluctant to add very domain-specific features to the core package. However, it is easy to extend xarray.

rabernat avatar May 06 '21 11:05 rabernat

In fact, the latest CZI EOSS proposal we are planning to submit to the current funding opportunity focuses precisely on this feature.

Awesome! BTW, @danielballan may have a lot to contribute to this based on his experiences.

Your feedback on that issue would be welcome and helpful. Particularly interested in how you define "compatibility with the NGFF spec". What would that mean specifically for xarray, besides just supporting generic image pyramids?

I made a note on that issue.

Obviously Xarray will be reluctant to add very domain-specific features to the core package.

Yes, we want to avoid domain-specific features. On that topic, there is a lot of related discussion in https://github.com/pydata/xarray/issues/1092 regarding netCDF organization. I am not familiar with netCDF, but, of course from the imaging side of things we want to avoid support for unrelated domain-specific features there.

However, it is easy to extend xarray.

Cool! I can see how this could be used to provide a better xarray-based API for imaging.

thewtex avatar May 06 '21 13:05 thewtex

0.3 has added _ARRAY_DIMENSIONS to the metadata. I am closing this, let us know if anything else is necessary for xarray compatibility, @thewtex.

constantinpape avatar Sep 01 '21 07:09 constantinpape

It turns out that we are currently not compatible with xarray, because it cannot deal with having different shapes for the arrays in a group:

group.zarr/
  array1 (shape=(100, 100, 100))
  array2 (shape=(50, 50, 50))

can thus not be opened by xarray; but this is exactly the use-case we need to support mult-scale image pyramids. We have decided to remove _ARRAY_DIMENSIONS for now, since it implies xarray compatability that is not there. See also https://github.com/ome/ome-zarr-py/issues/166.

@thewtex could you maybe follow up with the xarray folks on this? If we want to be compatible supporting different shapes is a must.

constantinpape avatar Feb 07 '22 12:02 constantinpape

cc: @aurghs

joshmoore avatar Feb 07 '22 13:02 joshmoore

it cannot deal with having different shapes for the arrays in a group:

Xarray has no problem with as many different shapes of arrays in a group as you want. The only catch is that shared dimensions must always be identical, i.e. you cannot have two arrays foo(y, x) and bar(y, x) where y and x are different. So for now, the easiest way to make these compatible with xarray is to give the two arrays different dimension names, i.e. foo(y1, x1), bar(y2, x2).

Going forward, the ongoing work on Xarray Datatree should make this more possible. (I still think the arrays would have to live in different groups.)

rabernat avatar Feb 07 '22 13:02 rabernat

you cannot have two arrays foo(y, x) and bar(y, x) where y and x are different

I see.

So for now, the easiest way to make these compatible with xarray is to give the two arrays different dimension names, i.e. foo(y1, x1), bar(y2, x2).

I don't think that this makes sense for our use-case. These axes are identical and arbitrarily renaming them is unnatural.

constantinpape avatar Feb 07 '22 13:02 constantinpape

These axes are identical

Not to be pendantic, but two intervals sampled at different resolution cannot really be considered "identical". i.e. 0, 1, 2, 3 is not identical to 0, 0.5, 1, 1.5, 2, 2.5, 3.5. 😉

I think what you really want is native support for multi-scale data in Xarray. You're correct that that doesn't exist (yet).

rabernat avatar Feb 07 '22 13:02 rabernat

Not to be pendantic, but two intervals sampled at different resolution cannot really be considered "identical". i.e. 0, 1, 2, 3 is not identical to 0, 0.5, 1, 1.5, 2, 2.5, 3.5. wink

In our case these axes don't describe the data input space, but describe the (common) physical output space. So yes, they are identical (but of course this distinction needs to be specified.)

I think what you really want is native support for multi-scale data in Xarray. You're correct that that doesn't exist (yet).

Yes, without having this xarray support does not make much sense in ngff.

constantinpape avatar Feb 07 '22 13:02 constantinpape

The physical space is indeed shared among all resolutions, but we are rarely interested in such abstract concepts. We just need to define a relationship between these shared global axes (such as world space or global time) and one of the coordinate systems and that's it. All the rest is about transformation between coordinate systems. We care a lot about coordinate systems and their axes, because we need those to interpret the coordinate values, i.e., the values that we store.

In the NGFF file format discussion there already seems to be a consensus that we need unique names for coordinate systems. As a result, we already have a unique name for each coordinate system axis. Therefore, xarray and NGFF seems to be very nicely compatible and capable of specifying multi-resolution images. All you need to do is to include the unique coordinate system names in the axis names. For example:

imgfull(image_x, image_y, image_z)
imghalf(image2x_x, image2x_y, image2x_z)
imgquarter(image4x_x, image4x_y, image4x_z)

Transforms can be defined using the same unique coordinate system names:

transforms=[
  ["image", "image2x", "affine", [0.5,0,0,0, 0,0.5,0,0, 0,0,0.5,0, 0,0,0,1]], 
  ["image", "image4x", "affine", [0.25,0,0,0, 0,0.25,0,0, 0,0,0.25,0, 0,0,0,1]]
]

lassoan avatar Feb 07 '22 14:02 lassoan

In the NGFF file format discussion there already seems to be a consensus that we need unique names for coordinate systems. As a result, we already have a unique name for each coordinate system axis.

This is not quite the state of the current proposal for transformations, please see https://github.com/ome/ngff/issues/94 for details. Just very briefly, the current proposal suggests that each multiscales (= collection of the arrays for different resolutions in the MIP) defines the physical output space and has an implicit axes names for the data space.

constantinpape avatar Feb 07 '22 14:02 constantinpape

It turns out that we are currently not compatible with xarray, because it cannot deal with having different shapes for the arrays in a group:

Xarray has no problem with as many different shapes of arrays in a group as you want. The only catch is that shared dimensions must always be identical, i.e. you cannot have two arrays foo(y, x) and bar(y, x) where y and x are different.

In my test xarray ngff implementation, each scale lived in its own nested group. This satisfies xarray's shape/dimension constraints. For ngff, this results in an additional path component:

            "multiscales": [
                {
                    "datasets": [
                        {
                            "path": "0/idr0094"
                        },
                        {
                            "path": "1/idr0094"
                        },
                        {
                            "path": "2/idr0094"
                        },

Note 0/idr0064 vs 0.

We have decided to remove _ARRAY_DIMENSIONS for now, since it implies xarray compatability that is not there.

This seems reasonable to me. There are additional constraints that need to be satisifed for a ngff dataset to be readable in xarray, and there is duplicated state related to the transforms. I like how the transforms in ngff are moving towards limited duplicated state -- a global transform that applies to all scales, for example.

We can make it possible to re-use the bulk pixel array data in a dataset zarr store for both a ngff metadata model and a xarray metadata model, but the metadata for these models will be different.

thewtex avatar Feb 07 '22 16:02 thewtex

Thanks for the input @thewtex. As far as I understand @rabernat there is no default way to specify multiscale data in xarray at this point, so I would hope that it will be possible to make that compatible with ngff.

constantinpape avatar Feb 07 '22 17:02 constantinpape

I've just posted an update to https://github.com/zarr-developers/zarr-specs/issues/125 that may be of interest to everyone following this issue. The prototype mentioned there is in https://github.com/aurghs/ome-datatree (@aurghs can perhaps also link to the related notebook that she demo'd today). If the outlined solution seems to make sense, I'd suggest we make this issue a proposal of the form:

Each scale should be stored in a subdirectory.

So that v0.1-v0.4 data of the form:

a.ome.zarr/0/.zarray

becomes

a.ome.zarr/0/.zgroup
a.ome.zarr/0/.zattrs        # with _ARRAY_DIMENSIONS
a.ome.zarr/0/image/.zarray  # "image" is up for discussion ("data"?)

and the multiscales metadata would contain:

  "datasets": [
    {
      "path": "0/image", ...
    }

Older data can continue to be opened (or even upgraded) with specialized backends like https://github.com/aurghs/ome-datatree.

joshmoore avatar Mar 10 '22 00:03 joshmoore

So each dataset path would include 0/image? If so, then the image doesn't have to be specified elsewhere. It becomes a convention rather than a spec?

will-moore avatar Mar 10 '22 08:03 will-moore

Exactly.

joshmoore avatar Mar 10 '22 14:03 joshmoore

The prototype mentioned there is in https://github.com/aurghs/ome-datatree (@aurghs can perhaps also link to the related notebook that she demo'd today).

I have pushed the notebook in the repository in notebook folder.

aurghs avatar Mar 11 '22 09:03 aurghs

A good first step would be for xarray to provide a way of opening chunked arrays with user-provided coordinates. Then we can turn NGFF metadata into coordinate DataArrays (i.e. https://github.com/JaneliaSciComp/xarray-ome-ngff/blob/ebcce4876dd9c0ecb3b7a635cea781007e9b24ce/src/xarray_ome_ngff/latest/multiscales.py#L242 ) and strap that onto something like the result of xarray.open_dataarray.

EDIT: This is possibly already available? I'm not sure whether a dask.Array should wrap an xarray.DataArray or vice versa.

clbarnes avatar Jul 05 '23 12:07 clbarnes

EDIT: This is possibly already available? I'm not sure whether a dask.Array should wrap an xarray.DataArray or vice versa.

xarray.DataArray wraps a dask.Array. This is already very well supported in Xarray.

A good first step would be for xarray to provide a way of opening chunked arrays with user-provided coordinates.

It sounds like this could be a good fit for a custom Xarray backend: https://docs.xarray.dev/en/stable/internals/how-to-add-new-backend.html

A custom backend just returns and xarray.Dataset, which can be constructed however you like. As a starting point for prototyping, I would just write a function that produces an Xarray dataset with the correct coordinates / metadata. I'd recommend reviewing the docs on creating a DataArray and creating a Dataset.

Let us know how we can help!

rabernat avatar Jul 05 '23 13:07 rabernat