sssom icon indicating copy to clipboard operation
sssom copied to clipboard

is there interest in an analog of SSSOM for NER/CR/text annotation?

Open cmungall opened this issue 2 years ago • 15 comments

There are a number of different tools that perform NER on text, from bioportal/zooma through to scispacy, @cthoyt's Gilda ( gilda https://www.biorxiv.org/content/10.1101/2021.09.10.459803v1.full )

These all vary in their output but are some variant of text span location and ID plus metadata for the matched concept.

While the entity normalization step of NER could be seen as term matching, I think this is out of scope for SSSOM. However, I think it would make sense to have a SSSOM analog, where the SSSOM metadata element URIs are reused.

In fact I did a very quick and dirty first pass at this:

https://incatools.github.io/ontology-access-kit/datamodels/text-annotator/index.html https://github.com/INCATools/ontology-access-kit/blob/main/src/oaklib/datamodels/text_annotator.yaml

I think it would be useful to standardize on this, for applications like our https://github.com/monarch-initiative/ontorunner that wrap multiple different annotators for aggregating results, cc @hrshdhgd

cc @graybeal

cmungall avatar Apr 14 '22 23:04 cmungall

Cool! Sounds interesting! No comment at the moment, but I think this looks useful.

matentzn avatar Apr 15 '22 08:04 matentzn

Gilda is explicitly not an NER tool - it only does named entity normalization, which means you already have a piece of text that is representing a named entity, and it figures out a grounding for it. Unfortunately, this is a very common misconception. I am a bit confused about what you mean by this issue since I think we have a different understanding of some of the vocabulary that's used here

cthoyt avatar Apr 15 '22 09:04 cthoyt

ah that was just my misconception about gilda. The goal here is to represent the full CR step - NER plus grounding/normalization

cmungall avatar Apr 15 '22 21:04 cmungall

  • NER tools usually relies on other standards e.g., BRAT (https://brat.nlplab.org/standoff.html)
  • Ultimately the W3C has a standard for representing "Annotations", the Web Annotation Data Model (https://www.w3.org/TR/annotation-model/)
  • There is a frequent confusion with the words 'mapping' and 'annotation' coming from the fact that some would consider "mapping text to entities" as a mapping and other as an annotation. I would avoid adding to this confusion by making SSSOM too flexible to represent other things than "ontological mappings" as its name indicate.

For these 3 reasons, I would not develop such an interest.

jonquet avatar Apr 19 '22 00:04 jonquet

OA seems much more general

Practical use case: the use of a standard format for genome coordinates (GFF) has allowed lots of different datasets and browsers (eg JBrowse) to be combined. It would be nice to have something similar for text annotations so that we could use the same markup (e.g. the nice spacy markup) with different annotators. It would also be nice if this were a modern JSON based serialization or a well-behaved TSV with a defined datamodel, a schema that can be used for validation, optionally with datamodel elements mapped to IRIs

cmungall avatar May 09 '22 17:05 cmungall

and to be clear: this is out of scope for SSSOM. I am exploring interest in an analog

cmungall avatar May 09 '22 17:05 cmungall