ome-zarr-py
ome-zarr-py copied to clipboard
Add class annotations and/or other metadata properties to labels
Currently the labels spec supports the declaration of a label-value
and its associated color
.
Commonly, label values have other associated information including the most obvious, the class name. napari also supports display of label properties, so this would be a nice additional feature for the reader plugin.
I think the critical requirements for these properties should be:
- Supporting an easy mapping between a given property and the
label-value
/s it is associated with - Enforcing as few rules as possible on what kinds of properties can be accepted
- Supporting an arbitrary number of properties
There are three ways I can see the spec supporting these additional properties:
- Arbitrary number of lists of max length n for a label image containing n label values, each corresponding to a property. The index in the list corresponds to the integer
label-value
e.g.
"image-label": {
"version": "0.1",
"colors": [
{
"label-value": 1,
"rgba": [
255,
100,
100,
255
]
},
{
"label-value": 2,
"rgba": [
0,
40,
200,
255
]
},
{
"label-value": 3,
"rgba": [
148,
50,
165,
255
]
}
],
"properties": [
{
"class": [
"Urban",
"Water",
"Agriculture"
],
"area_m2":
[
"400",
"1532",
"590"
]
}
]
}
I think this is least explicit, and less intuitive than the next approaches.
- Declare another group similar to colors, where each
label-value
has its own associated properties:
{
"multiscales": [
{
"datasets": [
{
"path": "0"
},
{
"path": "1"
},
{
"path": "2"
},
{
"path": "3"
}
],
"version": "0.1"
}
],
"image-label": {
"version": "0.1",
"colors": [
...
],
"properties": [
{
"label-value": 1,
"class": "Urban",
"area_m2": "400"
},
{
"label-value": 2,
"class": "Water",
"area_m2": "1532"
},
{
"label-value": 3,
"class": "Agriculture",
"area_m2": "590"
}
]
}
}
This is explicit, but has the disadvantage of duplicating the label-value
definitions.
- Make color another property e.g.
"properties": [
{
"label-value": 1,
"rgba": [
255,
100,
100,
255
],
"class": "Urban",
"area_m2": "400"
},
{
"label-value": 2,
"rgba": [
148,
50,
165,
255
],
"class": "Water",
"area_m2": "1532"
},
{
"label-value": 3,
"rgba": [
148,
50,
165,
255
],
"class": "Agriculture",
"area_m2": "590"
}
]
This doesn't duplicate label-values
, and has the benefit of keeping all properties associated with a particular label-value
in one spot.
On the implementation side, I think the differences in parsing the properties are negligible.
I'd love to hear what other people think are appropriate ways to represent the properties in the label metadata, or what they think the best option is.
This issue has been mentioned on Image.sc Forum. There might be relevant details there:
https://forum.image.sc/t/multi-scale-image-labels-v0-1/43483/3
Hi @DragaDoncila. Sorry for the slow response. Took some time to get caught up after the call. :wink:
Having this conversation kicked off is great! And I certainly like what you're proposing with 3, but even though the v0.1 proposal hasn't really been officially released, there are a number of repositories that are already implementing it.
A few options I can imagine are:
- We move this discussion to the v0.1 discussion and update it. Option 3 becomes effectively a "breaking" change even though not technically.
- We get v0.1 out, your proposal becomes v0.2 and we then deal with the upgrade process (which is a great thing to work through)
- We add option 2 as a non-breaking (neither technically nor effectively) and then when there is a breaking change, we introduce option 3 or something like it.
I should add that I think another similar breaking change may come when tabular data is supported in which case we may move some of this metadata into arrays for dealing with very large numbers of labels.
Option 3 looks the cleanest but a big disadvantage is future additions to the spec may use property names that now clash with the user-defined ones unless there is some way to indicate reserved names. In this respect Option 2 seems better despite the duplication of label-value
, as all user-defined properties can go under properties
without worrying about future conflicts.
Option 4 could be a variant of 3 where the user properties are under a dedicated subkey (I can't think of a good name so I've called it extra-properties
in the example):
"properties": [
{
"label-value": 1,
"rgba": [
255,
100,
100,
255
],
"extra-properties": {
"class": "Urban",
"area_m2": "400",
"other": [1, 2, 3, 4]
}
},
I think I prefer Option 2!
I don't see a big problem with duplication of the label-value
key, and this is also clearer that spec-defined attributes (e.g. colors) are easily distinguished from custom properties. No naming conflicts, but without so much nesting as Option 4.
Hi everyone,
Sorry for the late response - I've been finishing my honours thesis over the last few days so it's been packed.
Thanks for all the input! Having read through the suggestions here, I think @manics concern about clashes with future reserved names is the biggest disadvantage of Option 3. The extra-properties
or user-properties
subkey would definitely solve this issue but seems less elegant.
Despite initially thinking Option 3 was the way to go, I now actually think I agree with @will-moore that Option 2 seems preferable, as it fully separates spec properties and user defined properties.
@joshmoore how does that mesh with your longer term view of tabular metadata?
I don't think we necessarily need to constrain ourselves to the equivalent of tabular metadata for the properties
key, instead we could say it's an array of JSON values. These could be flat key-value dictionaries or arrays if the intention is to convert them to a table, but nested dictionaries could also be allowed.
This issue has been mentioned on Image.sc Forum. There might be relevant details there:
https://forum.image.sc/t/multi-scale-image-labels-v0-1/43483/9
I don't think we necessarily need to constrain ourselves to the equivalent of tabular metadata for the
properties
key,
I was thinking about the reverse. I can see having the JSON keys be "deeper", but what does one do when wants to add gigabytes of tabular data? It's not required to solve that now but it will come up eventually.
For what it's worth, https://www.w3.org/TR/csv2json/ has some examples. Looks like the method there is a top-level object per row.
All the being said, I can definitely still see option 2 as a first non-breaking change that we iterate on.
cc: @manzt
Hello, we were also thinking about image regions where objects overlap, see discussion here: https://forum.image.sc/t/multi-scale-image-labels-v0-1/43483/7
I am not sure, but maybe this could be tackled by something like:
properties": [
{
"label-value": 1,
"associated-label-values": [3]
"class": "Urban",
"area_m2": "400"
},
{
"label-value": 2,
"associated-label-values": [3]
"class": "Water",
"area_m2": "1532"
},
{
"label-value": 3,
"child-labels": [2, 1]
}
]
This would mean that label 3 is a region where labels 1 and 2 overlap. It also means that the image region that semantically corresponds to label 1 is actually bigger, namely the union of the regions covered by label 1 and 3.
The "associated-label-values" is redundant with the "child-labels" and maybe should be removed. I added it here because, in practice, it could be good to see at one glance that label 1 alone does not fully cover "object 1" but only when combined with the image region covered by label 3.
@constantinpape, do you maybe have comments or suggestions?
@tischi yes, I think this could be a good solution for overlapping labels.
I think this opens up a few more questions that are maybe also relevant for the overall discussion of the label properties:
- If a field is present in one properties, does it need to be in all the others? E.g. do we need
class
in all elements in the property list? - If it needs to be present in all of them, then in this case
class
is not so trivial, because it would be a composite ofWater
andUrban
.
If a field is present in one properties, does it need to be in all the others? E.g. do we need class in all elements in the property list?
I would say if we go for above list based approach we should not require a field to be present for all labels. If the storage layout would be more table based, then, I guess, yes, we would have to.
I think above list based approach is nice as it provides a lot of flexibility in terms of different labels having more or less information attached to them.
The disadvantage that I see compared with a table based approach is that it will require more storage space and could thus be quite slow to download and parse in order to e.g. build a table from it.
Thus for use cases with millions of labels I am a bit worried about performance.
I think we'll want both options: JSON style nested dictionaries for arbitary properties and support for tabular data. In the short term JSON dictionaries are relatively easy to add to the spec so it makes sense to start there.
Whew. Ok. So it sounds like we have some points for future discussion, but generally a consensus that we could start building, no? @DragaDoncila, have you already started on a branch anywhere? If not but were looking to start, do you think you have everything you need for a first pass?
@joshmoore I've started a branch, which has Option 1 already implemented. From what I read here, Option 2 is the consensus to start with, before we move on to adding support for tabular data. I think I have everything I need for a first pass, so I'll put up a WIP PR by Monday afternoon if that timeline is okay
Sounds amazing. Thanks, @DragaDoncila !