coralnet icon indicating copy to clipboard operation
coralnet copied to clipboard

Label taxonomies

Open StephenChan opened this issue 6 years ago • 3 comments

This has come up several times over the years. Not sure if it'll happen anytime soon, but it should have a discussion thread, at least.

The idea comes up most often when talking about taxonomy. Labeling corals by genus is a valid labeling scheme, as is labeling corals by species. Some users may want to view data either by genus or by species without having to re-annotate everything.

If a point is labeled as the species Acanthastrea echinata, it should count as the genus Acanthastrea as well, without having to apply two labels to indicate that. It could also count as the family Lobophylliidae. This can be achieved if we have a label hierarchy, with Acanthastrea echinata being a sub-label of Acanthastrea, and Acanthastrea being a sub-label of Lobophylliidae.

If we have a global hierarchical structure which is even remotely debatable (e.g., anything to do with Sand or Rock), then the inflexibility could make life difficult for any users who don't agree with it. So we would probably either have:

  • A global hierarchy which only adds structure to biological taxonomy labels, and leaves everything else as a flat structure; or
  • An interface for a source admin to define their own hierarchy of LocalLabels.

Things get interesting for the vision backend here. First, there are now multiple correct answers for a point's label, with some being more specific than others. Second, the vision backend could potentially report something like "I'm not very confident about this point's coral species, but I'm quite confident that it's of the genus Acropora, so I'll make that my main suggestion."

A related idea is the concept of custom label groups (#73 ). That could potentially address some of the itch from users who have suggested a label hierarchy. They could have labels specifying coral species, a set of custom groups for coral genera, and another set of custom groups for coral families.

StephenChan avatar Jun 07 '18 10:06 StephenChan

Various thoughts from core team emails in the past week:

there would be a taxonomy for say species/genera/etc but also taxonomies for say disease, bleaching status, morphology, etc. Likely these taxonomies are smaller than the "tree of life" taxonomy.

If we have verified/unverified labels (or public/private), then I suppose it makes sense for taxonomy instances to be verified/unverified (public/private) as well.

We should stay away from pushing pre-defined labelsets / taxonomies onto people. The folks at AIMS tried to work on the Catami scheme for years, and from what I understand it's still being changed. ... Anyone can define their own taxonomy for their source which points to the shared labels. All thus defined taxonomies will be made public on the site (perhaps we even have a nice graph tool to display them?), and similar to what we do with labels, we order the taxonomies by the number of annotations they cover, the number of sources that use them, and so on. So organically there will emerge a few taxonomies that perhaps get large adoption, although I wouldn't bet on that either. Once we feel good about transfer learning, we could even try attaching pre-trained (shared) classifiers to the taxonomies.

Each source could adopt multiple label-sets/taxonomies, e.g. one for tree-of-life, one for bleaching, etc.

[We could consider] Changing the label-sets to taxonomies. Were it not for the automated annotations, this change would mostly affect cover estimation and other post-processing stuff like that as well as the annotation tool UI. However, since we want to support automated annotations, we'd need to think about how to do that. Basically, for each decision, the classifier has to somehow decide how far down the tree it should predict. I think this requires real thought and effort to work well.

No doubt folks won't agree on a taxonomy. The certainly need to be able to create there own. Maybe even start with a base taxonomy and edit it. I'm not sure how hard it will be to allow users to create and modify taxonomies. Do they create one in a CSV/JSON file and upload? Through a GUI? Is it strictly a tree or can it be a DAG? Do they pick leaves as labels, and pick up the higher parts of the tree from a public taxonomy?

I envision that all nodes, including leaves, are labels from the public taxonomy.

StephenChan avatar Mar 08 '20 09:03 StephenChan

Detailed impact of adding taxonomies

For this post, I'm just going to focus on source-specific taxonomies, not yet worrying about how to present public ones.

Database relationships

Building on Oscar's idea that labelsets can become taxonomies: Currently, a Labelset is a flat collection of source-owned LocalLabels (which in turn point to global Labels). It may be sufficient to simply give LocalLabel a 'parent' field. This field would be a foreign key to another LocalLabel in the same Labelset.

If parent is null, it means the LocalLabel is the root of its taxonomy tree. If not null, the parent indicates the LocalLabel's position in a taxonomy tree. This scheme, without further restrictions, allows LocalLabels to be organized in multiple taxonomy trees, and also allows leaving some LocalLabels as standalone nodes.

A parent field seems sufficient for a tree, but if we actually have a DAG in general, then we either want support for multiple 'parents', or we store the edges as a separate table (two fields: source LocalLabel, destination LocalLabel).

Annotation tool

For the machine suggestions, we might want some way of controlling how far down the taxonomy tree the suggestions show up as. Some people might want all corals labeled down to species level, using the genus-level label solely for data aggregation purposes. Other people might want to label certain types of Porites down to species level, while labeling others as simply the Porites genus, as a way of saying 'This is some other type of Porites that we're not focusing on as much'.

We might consider organizing the label buttons (at the bottom of the tool) in a way that indicates the taxonomies. Probably not critical, but would be nice.

Source labelset pages

We need an interface to edit a source's taxonomies (parent fields of LocalLabels).

  • Keeping it simple, this can be a tabular form with a parent field on each row, similar to the 'Edit label codes' page. Either a dropdown or a text field to specify the parent label. As you specify parents, a sideways tree view (assuming we want a tree, not DAG) gets updated with the taxonomic relationships. Something like this:

    issue-165_sideways-tree

    I assume we'll generally have more breadth than depth in the tree, so a sideways tree like this will generally expand vertically, which is easier for webpages compared to horizontally.

  • If you specify a parent which creates a diamond or cycle in the taxonomy, it has to be disallowed somehow.

  • Editing the taxonomy might have to trigger a backend reset of some sort.

Import and export from/to CSV should include the taxonomic information.

Source browse pages

For Browse Patches, if Porites is the parent of Porites lobata, a search for Porites patches should include Porites lobata patches.

Actually, we have to watch for descendants in general, not just children. The database query might get somewhat complicated, unless we maintain extra database links directly between ancestors and descendants.

Export

Export image covers should account for the taxonomies, when counting annotations of the Porites genus, for example. Percentages will add up to more than 100%.

Source backend page (example)

We might carefully consider how to change reporting of accuracy scores. If the machine labels all Porites as the Porites genus, but there are Porites species labels available, should we still give a 100% score?

Confusion matrices have a similar concern. Porites lobata getting labeled as Porites is less of 'confusion' and more of playing it safe.

Label detail page (example)

This page shows how many annotations across the site are using the label, and also shows example patches from various sources. In doing so, we should respect the taxonomies of each source we draw from. Note that the results might not be satisfying if there is significant disagreement between sources on how to categorize a particular label.

API deploy

Deploy should take the taxonomy specification along with the labelset.

Other

In the above discussion, I've been treating 'taxonomy' as a characteristic of a labelset. Do we instead want to re-brand labelsets as taxonomies, or something similar?

Overall impression

The ancestor-descendant database queries and the tree drawing are medium-level concerns: might be challenging, but likely doable. Otherwise, the changes seem reasonably straightforward on my end. (Let me know if any of my other assumptions may be off though, such as not worrying about general DAGs.) The machine suggestions and scoring seem to be the trickiest part.

StephenChan avatar Mar 08 '20 21:03 StephenChan

We've determined that some substantial ML research is needed to make machine suggestions work with taxonomies. So this isn't likely to happen very soon, but we'll look out for opportunities to fund that research.

StephenChan avatar Mar 09 '20 23:03 StephenChan