ome-zarr-py icon indicating copy to clipboard operation
ome-zarr-py copied to clipboard

Add class annotations and/or other metadata properties to labels

Open DragaDoncila opened this issue 4 years ago • 17 comments

Currently the labels spec supports the declaration of a label-value and its associated color.

Commonly, label values have other associated information including the most obvious, the class name. napari also supports display of label properties, so this would be a nice additional feature for the reader plugin.

I think the critical requirements for these properties should be:

  • Supporting an easy mapping between a given property and the label-value/s it is associated with
  • Enforcing as few rules as possible on what kinds of properties can be accepted
  • Supporting an arbitrary number of properties

There are three ways I can see the spec supporting these additional properties:

  1. Arbitrary number of lists of max length n for a label image containing n label values, each corresponding to a property. The index in the list corresponds to the integer label-value e.g.
    "image-label": {
        "version": "0.1",
        "colors": [
            {
                "label-value": 1,
                "rgba": [
                    255,
                    100,
                    100,
                    255
                ]
            },
            {
                "label-value": 2,
                "rgba": [
                    0,
                    40,
                    200,
                    255
                ]
            },
            {
                "label-value": 3,
                "rgba": [
                    148,
                    50,
                    165,
                    255
                ]
            }
        ],
        "properties": [
            {
                "class": [
                    "Urban",
                    "Water",
                    "Agriculture"
                ],
                "area_m2":
                [
                    "400",
                    "1532",
                    "590"
                ]
            }
        ]
    }

I think this is least explicit, and less intuitive than the next approaches.

  1. Declare another group similar to colors, where each label-value has its own associated properties:
{
    "multiscales": [
        {
            "datasets": [
                {
                    "path": "0"
                },
                {
                    "path": "1"
                },
                {
                    "path": "2"
                },
                {
                    "path": "3"
                }
            ],
            "version": "0.1"
        }
    ],
    "image-label": {
        "version": "0.1",
        "colors": [
               ...
        ],
        "properties": [
            {
                "label-value": 1,
                "class": "Urban",
                "area_m2": "400"

            },
            {
                "label-value": 2,
                "class": "Water",
                "area_m2": "1532"
            },
            {
                "label-value": 3,
                "class": "Agriculture",
                "area_m2": "590"

            }
        ]
    }
}

This is explicit, but has the disadvantage of duplicating the label-value definitions.

  1. Make color another property e.g.
        "properties": [
            {
                "label-value": 1,
                "rgba": [
                    255,
                    100,
                    100,
                    255
                ],
                "class": "Urban",
                "area_m2": "400"

            },
            {
                "label-value": 2,
                "rgba": [
                    148,
                    50,
                    165,
                    255
                ],
                "class": "Water",
                "area_m2": "1532"
            },
            {
                "label-value": 3,
                "rgba": [
                    148,
                    50,
                    165,
                    255
                ],
                "class": "Agriculture",
                "area_m2": "590"

            }
        ]

This doesn't duplicate label-values, and has the benefit of keeping all properties associated with a particular label-value in one spot.

On the implementation side, I think the differences in parsing the properties are negligible.

I'd love to hear what other people think are appropriate ways to represent the properties in the label metadata, or what they think the best option is.

DragaDoncila avatar Oct 31 '20 08:10 DragaDoncila

This issue has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/multi-scale-image-labels-v0-1/43483/3

imagesc-bot avatar Nov 03 '20 07:11 imagesc-bot

Hi @DragaDoncila. Sorry for the slow response. Took some time to get caught up after the call. :wink:

Having this conversation kicked off is great! And I certainly like what you're proposing with 3, but even though the v0.1 proposal hasn't really been officially released, there are a number of repositories that are already implementing it.

A few options I can imagine are:

  • We move this discussion to the v0.1 discussion and update it. Option 3 becomes effectively a "breaking" change even though not technically.
  • We get v0.1 out, your proposal becomes v0.2 and we then deal with the upgrade process (which is a great thing to work through)
  • We add option 2 as a non-breaking (neither technically nor effectively) and then when there is a breaking change, we introduce option 3 or something like it.

I should add that I think another similar breaking change may come when tabular data is supported in which case we may move some of this metadata into arrays for dealing with very large numbers of labels.

joshmoore avatar Nov 03 '20 10:11 joshmoore

Option 3 looks the cleanest but a big disadvantage is future additions to the spec may use property names that now clash with the user-defined ones unless there is some way to indicate reserved names. In this respect Option 2 seems better despite the duplication of label-value, as all user-defined properties can go under properties without worrying about future conflicts.

manics avatar Nov 03 '20 11:11 manics

Option 4 could be a variant of 3 where the user properties are under a dedicated subkey (I can't think of a good name so I've called it extra-properties in the example):

        "properties": [
            {
                "label-value": 1,
                "rgba": [
                    255,
                    100,
                    100,
                    255
                ],
                "extra-properties": {
                  "class": "Urban",
                  "area_m2": "400",
                  "other": [1, 2, 3, 4]
                }
            },

manics avatar Nov 03 '20 11:11 manics

I think I prefer Option 2! I don't see a big problem with duplication of the label-value key, and this is also clearer that spec-defined attributes (e.g. colors) are easily distinguished from custom properties. No naming conflicts, but without so much nesting as Option 4.

will-moore avatar Nov 03 '20 13:11 will-moore

Hi everyone,

Sorry for the late response - I've been finishing my honours thesis over the last few days so it's been packed.

Thanks for all the input! Having read through the suggestions here, I think @manics concern about clashes with future reserved names is the biggest disadvantage of Option 3. The extra-properties or user-properties subkey would definitely solve this issue but seems less elegant.

Despite initially thinking Option 3 was the way to go, I now actually think I agree with @will-moore that Option 2 seems preferable, as it fully separates spec properties and user defined properties.

@joshmoore how does that mesh with your longer term view of tabular metadata?

DragaDoncila avatar Nov 06 '20 23:11 DragaDoncila

I don't think we necessarily need to constrain ourselves to the equivalent of tabular metadata for the properties key, instead we could say it's an array of JSON values. These could be flat key-value dictionaries or arrays if the intention is to convert them to a table, but nested dictionaries could also be allowed.

manics avatar Nov 09 '20 14:11 manics

This issue has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/multi-scale-image-labels-v0-1/43483/9

imagesc-bot avatar Nov 09 '20 16:11 imagesc-bot

I don't think we necessarily need to constrain ourselves to the equivalent of tabular metadata for the properties key,

I was thinking about the reverse. I can see having the JSON keys be "deeper", but what does one do when wants to add gigabytes of tabular data? It's not required to solve that now but it will come up eventually.

For what it's worth, https://www.w3.org/TR/csv2json/ has some examples. Looks like the method there is a top-level object per row.

All the being said, I can definitely still see option 2 as a first non-breaking change that we iterate on.

cc: @manzt

joshmoore avatar Nov 09 '20 19:11 joshmoore

Hello, we were also thinking about image regions where objects overlap, see discussion here: https://forum.image.sc/t/multi-scale-image-labels-v0-1/43483/7

I am not sure, but maybe this could be tackled by something like:

properties": [
            {
                "label-value": 1,
                "associated-label-values": [3]
                "class": "Urban",
                "area_m2": "400"

            },
            {
                "label-value": 2,
                "associated-label-values": [3]
                "class": "Water",
                "area_m2": "1532"
            },
            {
                "label-value": 3,
                "child-labels": [2, 1]
            }
]

This would mean that label 3 is a region where labels 1 and 2 overlap. It also means that the image region that semantically corresponds to label 1 is actually bigger, namely the union of the regions covered by label 1 and 3.

The "associated-label-values" is redundant with the "child-labels" and maybe should be removed. I added it here because, in practice, it could be good to see at one glance that label 1 alone does not fully cover "object 1" but only when combined with the image region covered by label 3.

@constantinpape, do you maybe have comments or suggestions?

tischi avatar Nov 10 '20 07:11 tischi

@tischi yes, I think this could be a good solution for overlapping labels.

I think this opens up a few more questions that are maybe also relevant for the overall discussion of the label properties:

  • If a field is present in one properties, does it need to be in all the others? E.g. do we need class in all elements in the property list?
  • If it needs to be present in all of them, then in this case class is not so trivial, because it would be a composite of Water and Urban.

constantinpape avatar Nov 10 '20 09:11 constantinpape

If a field is present in one properties, does it need to be in all the others? E.g. do we need class in all elements in the property list?

I would say if we go for above list based approach we should not require a field to be present for all labels. If the storage layout would be more table based, then, I guess, yes, we would have to.

I think above list based approach is nice as it provides a lot of flexibility in terms of different labels having more or less information attached to them.

The disadvantage that I see compared with a table based approach is that it will require more storage space and could thus be quite slow to download and parse in order to e.g. build a table from it.

Thus for use cases with millions of labels I am a bit worried about performance.

tischi avatar Nov 10 '20 09:11 tischi

I think we'll want both options: JSON style nested dictionaries for arbitary properties and support for tabular data. In the short term JSON dictionaries are relatively easy to add to the spec so it makes sense to start there.

manics avatar Nov 10 '20 09:11 manics

Whew. Ok. So it sounds like we have some points for future discussion, but generally a consensus that we could start building, no? @DragaDoncila, have you already started on a branch anywhere? If not but were looking to start, do you think you have everything you need for a first pass?

joshmoore avatar Nov 12 '20 14:11 joshmoore

@joshmoore I've started a branch, which has Option 1 already implemented. From what I read here, Option 2 is the consensus to start with, before we move on to adding support for tabular data. I think I have everything I need for a first pass, so I'll put up a WIP PR by Monday afternoon if that timeline is okay

DragaDoncila avatar Nov 12 '20 23:11 DragaDoncila

Sounds amazing. Thanks, @DragaDoncila !

joshmoore avatar Nov 13 '20 08:11 joshmoore