robot icon indicating copy to clipboard operation
robot copied to clipboard

Ignoring duplicate exact synonyms that are acronyms in robot report

Open allenbaron opened this issue 1 year ago • 8 comments

Given https://github.com/information-artifact-ontology/ontology-metadata/issues/135, is the plan now for robot report to exclude from warnings duplicate exact synonyms that are annotated as acronyms? Overlapping acronyms are fairly common.

This is a follow-up to the slightly tangential comment made in https://github.com/ontodev/robot/issues/748#issuecomment-703806593 by dosumis.

Slightly tangential, but we really need a way to mark synonyms as allowable duplicate with labels (maybe using synonym type?). We have many cases in FBbt where the same acronym is used in the literature for multiple distinct anatomical structures (pretty common in anatomy). We add these are synonyms with a reference to back them up. This is frequently useful to anyone looking to find a term based on what they find in the literature - curators and users. I guess the rule originally comes from GO where this is less of an issue with names for processes/MFs?

allenbaron avatar Jan 08 '24 17:01 allenbaron

@allenbaron I will help pushing this through. Do you know SPARQL? Could you try to redesign this query to achieve this goal: https://github.com/ontodev/robot/blob/master/robot-core/src/main/resources/report_queries/duplicate_exact_synonym.rq

If you have trouble with this you can ping @anitacaron (on slack also) who may have a soft spot for someone with QC related SPARQL problems :)

matentzn avatar Jan 11 '24 18:01 matentzn

The one caveat I want to say: if we do this, we have to use FILTER NOT EXISTS which is extremely slow - keep that in mind when you write this, and try it on something like DO, HPO and UBERON to be sure that it wont be too inefficient.

matentzn avatar Jan 11 '24 18:01 matentzn

Isn't it another exception for the label-synonym-polysemy-violation?

There's already an exception for abbreviation (OMO:0003000)

anitacaron avatar Jan 11 '24 18:01 anitacaron

Yes, acronym (OMO:0003012) is a new synonym type that would also be an exception.

Honestly, the query at UBERON linked by @anitacaron (with minor modification) is probably the best bet for updating the duplicate_exact_synonym.rq query in ROBOT. Using a subquery only slows things down a bit compared to the current query but it's definitely simpler and probably faster for managing exceptions. I think the only changes to it would be:

  1. Remove rdfs:label from VALUES statement.
  2. Add a VALUES statement for the exceptions (abbreviation & now acronym).
  3. Possibly drop the use of UCASE.
    • The current duplicate_exact_synonym.rq query will not report duplicates synonyms with variation in case or language tag (#748). Were those intentional design choices? Just noting that the UBERON query also will not report duplicate synonyms if they differ in language tag.

I know @jamesaoverton is particularly concerned with ROBOT's backward compatibility, which I appreciate. Would these changes be a concern in that regard?

allenbaron avatar Jan 12 '24 18:01 allenbaron

I decided to look more closely at execution time differences using doid-edit.owl and uberon.owl (because I had it on hand, not the edit file).

Just switching to the subquery approach without adding in the exclusion of synonym types or using UCASE takes about 1.07-1.43 times longer (DO: current = 6.13s, subquery = 6.57s; UBERON: current = 17.8s, subquery = 25.4s). Adding in the exclusion and UCASE slows things down further by ~ 2s for either DO & UBERON.

allenbaron avatar Jan 12 '24 21:01 allenbaron

@allenbaron thanks for the analysis!

Possibly drop the use of UCASE

I personally think we should introduce this now - I cannot imagine a single case where the duplicate synonym check should be case sensitive.. Of, course, this needs to be well documented!

variation in case or language tag

This is much more complicated, as you would want to

  1. reject duplicates within the same language and
  2. permit duplicates across languages.

Not sure how this should be solved!

Do you want to make a PR and see how it goes?

matentzn avatar Jan 15 '24 11:01 matentzn

As an alternative to creating an exclusion for abbreviations and acronyms, could we introduce a new synonym predicate, something like skos:closeMatch for synonyms oboInOwl:hasCloseSynonym?

I guess a new synonym predicate probably has more cons than pros. If we were really going to do something like this, we probably should've just made abbreviations and acronyms their own synonym predicates instead of making them synonym types.

I'll work to open a PR for updating the SPARQL query soon.

allenbaron avatar Jan 18 '24 16:01 allenbaron

could we introduce a new synonym predicate, something like skos:closeMatch for synonyms oboInOwl:hasCloseSynonym?

I don't think we should use that system for acronyms, which are "exact" synonyms, but now that you say this - it seems super weird to me that there are no close synonyms! I never noticed that! Wow!

I'll work to open a PR for updating the SPARQL query soon.

Thanks!!!

matentzn avatar Jan 18 '24 18:01 matentzn