ngff
ngff copied to clipboard
Collections Specification
What is an image collection? A collection of images is a semantic grouping of two or more associated ome-ngff images and/or image-labels.
This definition could include
- Images which do not share a physical coordinate space e.g. training dataset of images containing bees
- Images which share a physical coordinate space and whose storage specification must support sufficient metadata to determine this positioning e.g. high-content screening plates and wells
- A hierarchy of image groups of arbitrary depth which may or may not share physical coordinates
- Other things…?
What workflows should it support? The specification should support implementations being able to traverse the image collection and, where relevant, map the associated metadata to the physical coordinate space for loading these images.
Ideally, the specification should provide sufficient information at each level of a hierarchical grouping to allow for the loading of both the entire collection, and the loading of an arbitrary level of the hierarchy. This can be important when wanting to share/view partial datasets or update only small parts of the entire collection.
Where labels or other related data is provided (e.g. meshes, points…), the specification should support being able to associate any member of the image collection with its associated labels, regardless of the level in the hierarchy.
The OME-NGFF spec is close to supporting this functionality with the HCS specification which allows the positioning of wells into rows and plates. The main drawbacks of this specification are
- It is too specific to be easily used for images which ARE physically associated but are not HCS acquisitions
- It may be difficult to understand for researchers who are not working with HCS images but nevertheless wish to store their collection in OME-NGFF format
- It does not support an arbitrary depth of groupings
- It does not support collections which are not physically associated
What should it be called?
- Dataset - this term is already used in various places so may not be the best choice
- Collection - a general enough term which is currently mostly unused
- Hierarchical definition - there is a case for this specification being a hierarchy of specifications, with each one defining a more tightly bound collection e.g.
- Bag - associated images with no metadata
- Stack - associated images which overlap in physical space
- Panorama - associated images which stitch together in physical space
Ideally, the names used in the base specification would be general enough to support a broad variety of use cases and tailored use cases could be demonstrated using examples in the documentation.
Reference specifications BDV XML Files SVG TrakEM2 Napari Plugin for image-label collections mobie grid view of many sources
Related Image.SC discussion on collections Live notes from latest community call HCS Specification
What next? I think we should first decide on whether we want to support arbitrary levels in the hierarchy and whether we want a general spec which we can “inherit” from for more detailed specs, or whether we want one spec to rule them all.
My vote is that we define the most generic collection (a “bag” of images) which works with arbitrary levels of grouping (it’s collections all the way down), and then work to add to it for more complex collections. I will be working on this over the coming week and will post here once I have something working, but of course would love to hear what everyone’s thoughts are on the best way forward.
This issue has been mentioned on Image.sc Forum. There might be relevant details there:
https://forum.image.sc/t/next-call-on-next-gen-bioimaging-data-tools-feb-23/48386/9
@DragaDoncila Thank you very much for the detailed post! It makes a lot of sense and I am looking forward to whatever you come up with! Incidentally we (cc @constantinpape ) were also working on this topic during the past few days.
I also ping @d-v-b
I would like to add a notion and would be curious to hear opinions:
Currently I think that I would prefer storing images within a zarr container without any hierarchy, i.e. just as a flat list. The main reason is simplicity for the reader and writer libraries (the current HCS specifications does not follow this). Anything that imposes a hierarchy would be handled by the collections specification, which I think could be seen as metadata that specifies how to display and layout several images together.
Currently I think that I would prefer storing images within a zarr container without any hierarchy, i.e. just as a flat list.
This means you would like to store images as 2d arrays and volumes as 3d arrays, correct? I created #35 to discuss this.
This means you would like to store images as 2d arrays and volumes as 3d arrays, correct?
I was not gonna enter the 3D vs 5D discussion here, but just wanted to say that I feel that structuring the zarr like this: https://ngff.openmicroscopy.org/latest/#hcs-layout feels overly complex to me.
I was not gonna enter the 3D vs 5D discussion here, but just wanted to say that I feel that structuring the zarr like this: https://ngff.openmicroscopy.org/latest/#hcs-layout feels overly complex to me.
Ok, so your point is to rather have a flat hierarchy of images in the zarr container:
image1/
image2/
image3/
image4/
...
and then define the potential hierarchies in the collections metadata (just a mock-up):
{
"well1": ["image1", "image2"],
"well2": ["image3", "image4"]
}
Yes, exactly.
The way I see it conceptually is that a multi-well plate is a specific layout of a bag of images and, as such, should be covered by our collections specification, which I would currently see as metadata that exists independent of the way we store the raw image data. What do you think?
One feature that the current HCS layout gives us is a URL to a specific Well. So I can open a specific Well like: https://hms-dbmi.github.io/vizarr/v0.1?source=https://s3.embassy.ebi.ac.uk/idr/zarr/v0.1/plates/2551.zarr/A/1 Demo movie at https://twitter.com/will_j_moore/status/1322187662762090497
I guess you could try to use a URL ?query
or a #fragment
to refer to a Well or other subgroup. E.g. path/to/plate.zarr/#A1
a URL to a specific Well
I see that this is cool, but I am afraid that (i) these hierarchies make it harder parse an ome.zarr and (ii) it is not flexible; for example, I guess I cannot produce an single URL to show me all the images that were subjected to the same biological treatment (which may be several wells).
The way I see it conceptually is that a multi-well plate is a specific layout of a bag of images and, as such, should be covered by our collections specification, which I would currently see as metadata that exists independent of the way we store the raw image data.
I completely agree that the metadata and storage should be independent, because I think this also provides the opportunity to support a wider range of custom metadata. For example this:
{
"well1": ["image1", "image2"],
"well2": ["image3", "image4"]
}
could easily be this (for some geographical feature learning model):
{
"lakes": ["image1", "image2"],
"mountains": ["image3", "image4"]
}
I guess that's what I was thinking of when I said
tailored use cases could be demonstrated using examples in the documentation.
I like the idea of a flat set of images with the hierarchy determined entirely by the metadata. That certainly seems the easiest way to support an arbitrary level of hierarchy without ending up with a very complex storage structure.
https://github.com/ome/ngff/issues/31#issuecomment-787820475 Currently I think that I would prefer storing images within a zarr container without any hierarchy, i.e. just as a flat list.
Is this a MAY or a MUST? And what happens when/if someone does make use of the folder structure available in Zarr/N5/HDF5?
Re @tischi "flexibility and biological treatment" - I'm wondering if there must be a single 'hierarchy' in the container, e.g. If we can have multiple. E.g.
{
"well1": ["image1", "image2"],
"well2": ["image3", "image4"]
}
And:
{
"aquisition1": ["image1", "image3"],
"aquisition2": ["image2", "image4"]
}
or
{
"drug1": ["image1", "image3"],
"drug2": ["image2", "image4"]
}
Those are all different ways to grouping the images. But if you have:
{
"well1": ["image1", "image2"],
"well2": ["image3", "image4"],
"drug1": ["image1", "image3"],
"drug2": ["image2", "image4"]
}
how do you know which groups are mutually exclusive. E.g. which ones are Wells vv Treatments?
Having multiple hierarchies might provide more flexibility, but this makes it harder to understand how to view the data.
Instead, it might make more sense to only have a single hierarchy (like a file-system) and then add other metadata in other ways?
@will-moore
My current idea would be to have no hierarchy on the data storage level, but provide the possibility to specify different "views" on the data on the metadata level. Something along the lines:
views:
{
"well_based": {...},
"treatment_based":{...}
}
default_view: "well_based"
Does that make sense to you?
Is this a MAY or a MUST? And what happens when/if someone does make use of the folder structure available in Zarr/N5/HDF5?
Personally, I'd be for a MUST, i.e. not support hierarchies and then ignore anything stored at deeper levels. But, obviously, that's just my personal opinion. Very curious to hear other opinions!
Personally, I'd be for a MUST, i.e. not support hierarchies and then ignore anything stored at deeper levels. But, obviously, that's just my personal opinion. Very curious to hear other opinions!
I think I am not such a big fan of the MUST here. There are some use cases where hierarchies make a lot of sense to keep the data ordered. As an simple example: I have segmentations computed with two different algorithms and two hyperparameters for the algorithms, and I want to store them in the same container to compare them with some viewer that can ingest it. For this use case having
algorithm1/
parameter_set1
parameter_set2
parameter_set3
algorithm2/
parameter_set1
parameter_set2
is a more natural (and easier to navigate) way of storing this then
algorithm1_parameter_set1
algorithm1_parameter_set2
algorithm1_parameter_set3
algorithm2_parameter_set1
algorithm2_parameter_set1
There are some use cases where hierarchies make a lot of sense to keep the data ordered
OK, fair enough :)
I guess the question is whether, in practice, one would navigate the data via the "views" or via the folder structure. If one changes ones mind at some point about the folder structure, this could be quite expensive in terms of reordering all the data (at least that's how I understood how the object stores work), while it would be very cheap to just replace the views, isn't it?
I guess the question is whether, in practice, one would navigate the data via the "views" or via the folder structure. If one changes ones mind at some point about the folder structure, this could be quite expensive in terms of reordering all the data (at least that's how I understood how the object stores work), while it would be very cheap to just replace the views
Sure, reordering the folder structure is not such a good idea but also not necessary because we can have multiple views for the same data. But having a hierarchical folder structure does not change anything about the views except that there will be some \
in the data names.
But having a hierarchical folder structure does not change anything about the views except that there will be some \ in the data names.
Yes, that is true. I guess it'd be fine with a MAY, but should we then maybe "strongly encourage" that there is a default_view specified that one could go to in order to efficiently find out what's in the dataset, without having to go through the whole "folder structure"? (I am also thinking about our experience that things like cd
and ls
sometimes are super slow on object stores).
I'm also a little hazy on object stores, but my impression is that all the 'paths' within a bucket are really just 'keys'. So I imagine they could be changed without moving the data on disk.
So, as @constantinpape's said, I'm not sure there's really much difference between algorithm1/parameter_set1
and algorithm1_parameter_set1
. I don't think you can browse to algorithm1/
on an object store.
This is why you need to specify all the paths to child objects in the group metadata.
However, since you won't always be working with object stores, allowing algorithm1/parameter_set1
could let you browse the data via algorithm1/
elsewhere, so I think this could be helpful. Also conceptually helpful to tokenise the path in this way. So I think we should allow /
in the path names.
E.g. this could be valid:
{
"lakes": ["day1/image1", "image2"],
"mountains": ["image3", "day2/image4"]
}
Any reason not to allow this?
@will-moore I think that's fine and a very good point. On a file system there are some benefits to this and on an object store there are no disadvantages.
OK, so it looks like there's enough consensus here to start on something a bit more concrete. Aiming for something that is just a list of images in its simplest form, but can include more metadata without a breaking change.
Option 1
A path/to/collection/
directory would include a .zattrs
file that defines a "collection" because it MUST include the collection
key, which MUST contain an images
list:
{
"collection": {
"images": [
{"path": "image1"},
{"path": "dir_1/image2"},
]
}
}
Each path
is a path/to/directory containing an OME-Zarr
image.
Each item in the images
list MUST have a path
, but MAY also have other attributes (TBD: e.g. id
, name
, timestamp
, etc. Maybe even 'row': 0, 'column': 1
, for a grid layout.) We should probably not allow any user-chosen key-value data here, since that could lead to breaking changes if we add keys to the spec. So maybe a properties: {}
for user-defined metadata.
Option 2
An alternative is to use the "path" as the ID/key of each image. Any reason not to do this? (for labels-metadata we decided not to use an ID as key because the ID was a number which is not a valid key in JSON.). This protects from having 2 identical 'path' values which could be possible above.
{
"collection": {
"images": {
"image1": {}, # empty if we don't have any other info
"dir_1/image2": {"row": 0, "column": 1},
"dir_2/image3": {"properties": {"rating": 5}},
}
}
}
Other optional metadata
Within the collection
, alongside images
we could imagine other metadata such as layout:
"layout": {
"type": "grid", # or 'auto-grid'
"rows": [
{"name": "A"}, {"name": "B"}, {"name": "C"}
],
"columns": [
{"name": "1"}, {"name": "2"}, {"name": "3"}
],
},
and groupings. I guess we could use the path as the identifier of each image.
"groups": {
"lakes": ["image1", "dir_1/image2"],
"mountains": ["dir_2/image3"]
}
Which could mean that the "images" list/dict above is not needed (if we don't have any other metadata, and every image is in a group)? BUT it simplifies the spec to say that images
MUST exist, and it's not hard to always add it.
So, is everyone happy with Option 1 or Option 2? Or would like to suggest improvements to whichever is their favourite? The other metadata can be decided later, but any suggestions welcome.
https://github.com/ome/ngff/issues/31#issuecomment-791273590 So I imagine they could be changed without moving the data on disk.
No. To move objects around in object storage is always a copy/delete operation.
https://github.com/ome/ngff/issues/31#issuecomment-791033216 I guess the question is whether, in practice, one would navigate the data via the "views" or via the folder structure.
It sounds like we're struggling with the semantics of having one "hard-coded" hierarchy beside the additional collections. In the ome-zarr-py implementation (and we could work to formalize this), there's a generator pattern. You start at the group you're given and then ask for what it "points to" and then process that. You will always start from a single group, so perhaps we're saying that you will only use the metadata of the given group for the objects that are generated.
It sounds like we're struggling with the semantics of having one "hard-coded" hierarchy beside the additional collections.
@joshmoore I agree. Maybe, for simplicity, we could restrict this issue to discussion of the additional collections? and make an extra issue but "hard-coded" hierarchy?
A path/to/collection/ directory would include a .zattrs file
@will-moore Do I get it right that currently one such .zattrs file would contain only one collection? Meaning that to specify multiple additional collection we would need several path/to/collection/ .zattrs
? I guess that's fine, but then I guess somewhere there should be information how to find them?
So, is everyone happy with Option 1 or Option 2?
Option 2 looks more concise, so maybe slight preference for that one.
In terms of the layout, instead of specifying row and column, I think specifying a translation in physical coordinates may also be an option.
{
"collection": {
"images": {
"image1": {"translate": [0,0,0], "name": "A"},
"dir_1/image2": {"translate": [10,0,0], "name": "B"},
"dir_2/image3": {"translate": [20,0,0], "name": "C"},
}
}
}
I also prefer option2. And as @tischi brought up I think it's important to think about how to map different collections for the same data (or subsets of it), either in the same .zattrs or distributed into different ones in some defined pattern.
I was thinking the multiple groups
above were different groupings of images in a collection. But I guess that's not enough, e.g. if each image has a different translate
or other property in each collection.
So the simplest way to support multiple collections
is to make the dict -> list, and plural:
# .zattrs
{
"collections": [
{
"name": "first collection",
"images": {
"image1": {"translate": [0,0,0], "name": "A"},
"dir_1/image2": {"translate": [10,0,0], "name": "B"},
"dir_2/image3": {"translate": [20,0,0], "name": "C"},
}
},
{
"images": {
"dir_2/image3": {},
"dir_2/image4": {},
}
}
]
}
Do I get it right that currently one such .zattrs file would contain only one collection? Meaning that to specify multiple additional collection we would need several path/to/collection/ .zattrs? I guess that's fine, but then I guess somewhere there should be information how to find them?
I think if we want to easily support images being opened both as part of their collection and on their own then it would make sense to have each image as its own well-formed ome-zarr, including a .zattrs file? It would mean either duplication of some metadata, or a top level .zattrs which only contains the necessary information for traversing the collection i.e. the snippet @will-moore posted just above
Yes, in the examples I've posted, there would be a full OME-Zarr in each of the images
paths. E.g. dir_1/image2/
would contain .zattrs
etc.
Just to clarify one issue that will become more important with Zarr V3, each of those paths contains an OME-Zarr image
. In the future, there will be a root file which will define the entire OME-Zarr fileset
.
I'm going to try and see how we might migrate the metadata in the current HCS spec into this collections spec.
This is based on the current HCS spec: https://ngff.openmicroscopy.org/latest/#plate-md
and also includes proposed changes from https://github.com/ome/ngff/pull/24/ (adds row_index
and column_index
).
Here is a 6-well Plate, with 2 acquisitions (only 1 image in the 2nd acquisition).
The only custom info that is not generic to the collections spec is the acquisition starttime
which is therefore in a user-defined properties
dictionary. We should probably specify that any custom attributes should go in a properties
dictionary to avoid clashing with future spec key-words.
cc @melissalinkert
Anyone else want to try using this spec proposal with a sample of their current data needs, to see if it works for you and suggest any other key-words that we might need?
# .zattrs
{
"collections": [
{
"name": "HCS plate 01",
"images": {
"2020-10-10/A/1": {
"row_index": 0,
"column_index": 0
},
"2020-10-10_run2/A/1": {
"row_index": 0,
"column_index": 0
},
"2020-10-10/A/2": {
"row_index": 0,
"column_index": 1
},
"2020-10-10/A/3": {
"row_index": 0,
"column_index": 2
},
"2020-10-10/B/1": {
"row_index": 1,
"column_index": 0
},
"2020-10-10/B/2": {
"row_index": 1,
"column_index": 1
},
"2020-10-10/B/3": {
"row_index": 1,
"column_index": 2
}
},
"layout": {
"type": "grid",
"rows": [
{"name": "A"},
{"name": "B"},
{"name": "C"}
],
"columns": [
{"name": "1"},
{"name": "2"},
{"name": "3"}
]
},
"groups": {
"acquisition_1": {
"name": "Meas_01(2012-07-31_10-41-12)",
"properties": {
// custom user data can go here
},
"id": 1,
"starttime": 1343731272000,
"maximumfieldcount": 1,
"images": [
"2020-10-10/A/1",
"2020-10-10/A/2",
"2020-10-10/A/3",
"2020-10-10/B/1",
"2020-10-10/B/2",
"2020-10-10/B/3"
]
},
"acquisition_2": {
"name": "Meas_02(201207-31_11-56-41)",
"properties": {},
"id": 2,
"starttime": 1343735801000,
"maximumfieldcount": 1,
"images": [
"2020-10-10_run2/A/1"
]
}
}
}
]
}
As suggested by @sbesson I checked for any info that can be stored in the current HCS spec but wasn't in the example above. I only found id
and maximumfieldcount
for the acquisitions. Initially I felt these could go under each group
in the user-defined properties
. But that's really just a place for stuff that's not part of the spec, and since these are part of the current spec, I moved them up to be attributes of each group
object. But, if it's felt we want to reduce the scope of the collections spec then they could go under properties
.
Since there haven't been any objections to the latest proposal, can I assume that everyone is happy with it?
If so, my next steps would be to convert an existing HCS OME-Zarr to the format above (which should be possible without moving any images on disk) and then to look at updating the viewing of that data in ome-zarr-py
(for napari
) or in vizarr
(cc @manzt) to see if there are any blockers to the current features.
Sorry if I missed it @will-moore: how are sites (aka well sub-positions) treated with your proposal?