sssom
sssom copied to clipboard
is there interest in an analog of SSSOM for NER/CR/text annotation?
There are a number of different tools that perform NER on text, from bioportal/zooma through to scispacy, @cthoyt's Gilda ( gilda https://www.biorxiv.org/content/10.1101/2021.09.10.459803v1.full )
These all vary in their output but are some variant of text span location and ID plus metadata for the matched concept.
While the entity normalization step of NER could be seen as term matching, I think this is out of scope for SSSOM. However, I think it would make sense to have a SSSOM analog, where the SSSOM metadata element URIs are reused.
In fact I did a very quick and dirty first pass at this:
https://incatools.github.io/ontology-access-kit/datamodels/text-annotator/index.html https://github.com/INCATools/ontology-access-kit/blob/main/src/oaklib/datamodels/text_annotator.yaml
I think it would be useful to standardize on this, for applications like our https://github.com/monarch-initiative/ontorunner that wrap multiple different annotators for aggregating results, cc @hrshdhgd
cc @graybeal
Cool! Sounds interesting! No comment at the moment, but I think this looks useful.
Gilda is explicitly not an NER tool - it only does named entity normalization, which means you already have a piece of text that is representing a named entity, and it figures out a grounding for it. Unfortunately, this is a very common misconception. I am a bit confused about what you mean by this issue since I think we have a different understanding of some of the vocabulary that's used here
ah that was just my misconception about gilda. The goal here is to represent the full CR step - NER plus grounding/normalization
- NER tools usually relies on other standards e.g., BRAT (https://brat.nlplab.org/standoff.html)
- Ultimately the W3C has a standard for representing "Annotations", the Web Annotation Data Model (https://www.w3.org/TR/annotation-model/)
- There is a frequent confusion with the words 'mapping' and 'annotation' coming from the fact that some would consider "mapping text to entities" as a mapping and other as an annotation. I would avoid adding to this confusion by making SSSOM too flexible to represent other things than "ontological mappings" as its name indicate.
For these 3 reasons, I would not develop such an interest.
OA seems much more general
Practical use case: the use of a standard format for genome coordinates (GFF) has allowed lots of different datasets and browsers (eg JBrowse) to be combined. It would be nice to have something similar for text annotations so that we could use the same markup (e.g. the nice spacy markup) with different annotators. It would also be nice if this were a modern JSON based serialization or a well-behaved TSV with a defined datamodel, a schema that can be used for validation, optionally with datamodel elements mapped to IRIs
and to be clear: this is out of scope for SSSOM. I am exploring interest in an analog