uberon icon indicating copy to clipboard operation
uberon copied to clipboard

uberon polysemy issues

Open vasvir opened this issue 3 years ago • 22 comments

Hi,

Trying to use UBERON (uberon-simple.obo from https://github.com/obophenotype/uberon/releases/download/v2022-04-05/uberon-simple.obo) for text mining when I noticed the following polysemy issues.

The following terms are assigned in more than one UBERON Ids using only the term name and the EXACT synonym tags.

Uberon terms

  | Gasserian ganglion |   2 | UBERON:0001675, UBERON:3011045 |
  | nV                 |   2 | UBERON:0002633, UBERON:2005093 |
  | temporal pole      |   2 | UBERON:0002576, UBERON:0006479 |
  | trochlea of femur  |   2 | UBERON:4200096, UBERON:7500077 |

Bug description

How should I proceed resolving these polysemy issues? Should I try to create individual Issue/PR for each term?

Apologies if this is not the correct place to raise such an issue. I looked for a mailing list or a forum but I didn't find any. Any pointers to right direction are much appreciated.

Thanks in advance.

Vassilis

vasvir avatar Apr 18 '22 09:04 vasvir

Thank you @vasvir!

How comfortable are you editing text files?

If you make a pull request in this file:

https://github.com/obophenotype/uberon/blob/master/src/ontology/uberon-edit.obo

you can change the synonyms from exact to related. there by search and replace. Tag me in PR!

matentzn avatar Apr 18 '22 09:04 matentzn

will assign ticket to you @vasvir (please let us know if you need help etc.) as I'm trying to make sure tickets are assigned in uberon :) thanks

shawntanzk avatar May 02 '22 09:05 shawntanzk

Hi @matentzn and everybody else. I would like your advice if possible.

I have created (retrofitted actually) a perl script that checks an OBO ontology for polysemy issues. Here is how the output likes:

$ src/scripts/obo_check_polysemy.pl src/ontology/uberon-edit.obo
Set synonym_level: EXACT case sensitivity: 0 for file: src/ontology/uberon-edit.obo
2 ids for name: 'nv' ids: UBERON:0002633, UBERON:2005093, 
2 ids for name: 'phy' ids: UBERON:2000438, UBERON:8440028, 
2 ids for name: 'prelimbic area' ids: UBERON:0013560, UBERON:8440032, 
Problems found: 3

$ src/scripts//obo_check_polysemy.pl -h
Usage: /home/bill/workspace/processing_scripts/obo_check_polysemy.pl [-l|--synonym-level {EXACT|NARROW|RELATED|BROAD}] [-s|case-sensitive] ontology-file.obo

I wonder if you would consider this to become part of the project's CI (provided I will fix the existing issues). If there is interest in this I will create the necessary PR (with some much appreciated guidance) of course.

I am totally flexible on the license.

vasvir avatar Sep 01 '22 08:09 vasvir

@vasvir, thank you for working on this.

There are two things to consider here:

  1. We are trying very hard to burying PERL as a dependency. This is not a problem however, because we can help you migrate your test to SPARQL instead, once we understand it.
  2. I don't yet understand what ploysemy you are checking. We have a ton of QC against it, like https://robot.obolibrary.org/report_queries/duplicate_exact_synonym. Can you explain, for example, the three errors you found?

matentzn avatar Sep 01 '22 09:09 matentzn

I am not familiar with SPARQL (that will not be a problem though) but the link you sent looks like exactly what I am trying to accomplish (maybe in a case insensitive way).

Is there a way to run this test locally given I have a current git checkout of the project?

Thanks for the quick reply.

vasvir avatar Sep 01 '22 09:09 vasvir

sh run.sh make robot_reports

If you have docker installed!

matentzn avatar Sep 01 '22 09:09 matentzn

Thanks a lot @matentzn That covers a lot of ground.

I managed to run it and everything comes out nicely. The existing checks cover the 99% of the cases. Here are my 2 remaining issues.

  1. The SPARQL tests do not test if there is a Id1:name/id2: EXACT synonym discrepancy (only synonym vs synonym). Such a case was handled by PR #2470 where the label of UBERON:0002576 (temporal pole) was exact synonym of UBERON:0006479

  2. The tests are not case insensitive so nV != NV. I understand this is a much harder sell but consider that a) for applications like text mining a case insensitive search is often preferred and b) a case insensitive class often reveals a hasty abbreviation.

So if there is any interest I am prepared to give them a try.

vasvir avatar Sep 01 '22 11:09 vasvir

  1. Can you review this check and see if you see anything wrong with it? https://robot.obolibrary.org/report_queries/duplicate_label_synonym I hoped that check would cover it!
  2. This is true. Good point. It would require us though to duplicate all checks to take into account casing.. I would like to know the impact of that, and if we can at least reduce it to a single query (maybe a modification of https://robot.obolibrary.org/report_queries/duplicate_label_synonym)? Here is a potential query to test:
PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX oboInOwl: <http://www.geneontology.org/formats/oboInOwl#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT ?entity ?property ?value WHERE {
  VALUES ?property {
    obo:IAO_0000118
    oboInOwl:hasExactSynonym
    rdfs:label
  }
  FILTER NOT EXISTS { ?entity owl:deprecated true }
  FILTER NOT EXISTS { ?entity2 owl:deprecated true }
  ?entity rdfs:subClassOf <http://purl.obolibrary.org/obo/UBERON_0001062> .
  ?entity2 rdfs:subClassOf <http://purl.obolibrary.org/obo/UBERON_0001062> .
  ?entity ?propery ?value .
  ?entity2 ?property ?value2 .
  FILTER (!isBlank(?entity))
  FILTER (!isBlank(?entity2))
  FILTER(lcase(str(?value)) = lcase(str(?value2)))
}
ORDER BY ?entity
LIMIT 1

to test:

sh run.sh robot query -i uberon-edit.obo --query query.sparql output.tsv

matentzn avatar Sep 01 '22 11:09 matentzn

This is fun! Thanks a lot.

  1. duplicate_label_synonym sort of works. It issues a warning alright which is correct if the label/synonym are in the same Term (id). In that case the duplication is not an error. However when the label = synonym across different terms (ids) then it should throw an error and not a warning as it currently does IMHO.

  2. duplicate_label_synonym maybe is a good place to start but I had my mind on duplicate_exact_synonym that needs to be case insensitive also. Furthermore it features the FILTER (?entity != ?entity2) which strikes me as a good condition.

But I test your variant of duplicate_label_synonym. I don't understand the subClassOf lines. If I remove them takes for ever. I remove the LIMIT however and I get these which are not correct IMO.

?entity	?property	?value
<http://purl.obolibrary.org/obo/UBERON_0000465>	<http://www.w3.org/2000/01/rdf-schema#label>	"material anatomical entity"
<http://purl.obolibrary.org/obo/UBERON_0000466>	<http://www.geneontology.org/formats/oboInOwl#hasExactSynonym>	"immaterial physical anatomical entity"
<http://purl.obolibrary.org/obo/UBERON_0000466>	<http://www.w3.org/2000/01/rdf-schema#label>	"immaterial anatomical entity"
<http://purl.obolibrary.org/obo/UBERON_0000477>	<http://www.w3.org/2000/01/rdf-schema#label>	"anatomical cluster"
<http://purl.obolibrary.org/obo/UBERON_6004520>	<http://www.w3.org/2000/01/rdf-schema#label>	"insect mouthpart"

I think this query returns everything because it check e1.label = e2.label or e1.synonym = e2.synonym even when e1 = e2 which obviously holds and the subClassOf is a way to limit the output for testing purposes.

Let me educate my self a bit in SPARQL and I will be back with more queries (pun semi-intended :smile:).

vasvir avatar Sep 01 '22 14:09 vasvir

Hahah have fun @vasvir!!

ERROR levels are configurable. In my QC police view of the world, all checks should be set to ERROR. And Exceptions hard coded into the sparql query!

Here is how we handle "exceptions to the rule" in Mondo: https://github.com/monarch-initiative/mondo/blob/master/src/sparql/qc/mondo/qc-excluded-subsumption-is-inferred.sparql#L15

That way you can set everything to ERROR, which means QC will fail, but still have the flexibility to grant valid exceptions.

matentzn avatar Sep 01 '22 14:09 matentzn

BTW: https://oboacademy.github.io/obook/tutorial/sparql/

matentzn avatar Sep 01 '22 14:09 matentzn

Thanks for the pointers @matentzn

I tried the naive approach with a (implicit) self join which worked for my minimal test case but was running for ever in the full UBERON.

I changed strategy and used GROUP BY and COUNT DISTINCT and that gave the query planner a chance to optimize the query reasonably.

The output of the query below has two enhancements over existing duplicate_exact_synonym.

  1. Checks also names (not only synonyms). All combinations are covered (name-name, name-synonym, synonym-synonym)
  2. It works in a case insensitive manner

Possible issues are that the output is not in the form of the previous query.

The output is the same as my original perl script as it should be flagging 3 errors.

?names  ?cnt    ?ids
"NV; nV"        2       "http://purl.obolibrary.org/obo/UBERON_2005093; http://purl.obolibrary.org/obo/UBERON_0002633"
"Prelimbic area; prelimbic area"        2       "http://purl.obolibrary.org/obo/UBERON_0013560; http://purl.obolibrary.org/obo/UBERON_8440032"
"PHY; phy"      2       "http://purl.obolibrary.org/obo/UBERON_8440028; http://purl.obolibrary.org/obo/UBERON_2000438"

The query itself is here:

PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX oboInOwl: <http://www.geneontology.org/formats/oboInOwl#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT (GROUP_CONCAT(DISTINCT ?name; SEPARATOR="; ") as ?names) (COUNT(DISTINCT ?entity) AS ?cnt) (GROUP_CONCAT(DISTINCT ?entity; SEPARATOR="; ") as ?ids) WHERE {
  VALUES ?property {
    obo:IAO_0000118
    oboInOwl:hasExactSynonym
    rdfs:label
  }
  ?entity ?property ?name
  BIND(UCASE((?name)) AS ?iname)
  FILTER (!isBlank(?entity))
  FILTER NOT EXISTS { ?entity owl:deprecated true }
} GROUP BY ?iname HAVING (?cnt > 1)

vasvir avatar Sep 02 '22 10:09 vasvir

Awesome! Can you try to return the query to the required form using subquery syntax?

PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX oboInOwl: <http://www.geneontology.org/formats/oboInOwl#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?entity ?property ?value {
  {
  SELECT DISTINCT (GROUP_CONCAT(DISTINCT ?name; SEPARATOR="; ") as ?names) (COUNT(DISTINCT ?entity) AS ?cnt) (GROUP_CONCAT(DISTINCT ?entity; SEPARATOR="; ") as ?ids) WHERE {
    VALUES ?property {
      obo:IAO_0000118
      oboInOwl:hasExactSynonym
      rdfs:label
    }
    ?entity ?property ?name
    BIND(UCASE((?name)) AS ?iname)
    FILTER (!isBlank(?entity))
    FILTER NOT EXISTS { ?entity owl:deprecated true }
     # BIND(?name as ?value)
  	} GROUP BY ?iname HAVING (?cnt > 1)
  }
}

? (The above is just for illustration)

matentzn avatar Sep 02 '22 11:09 matentzn

I can try but just to be on the same page here.

The errors this query catches here span multiple ids (2 in all 3 examples) so the user needs to have both ids to figure it out.

If I (manage to) use a subquery I will essentially multiply the error. There is no way around that...

vasvir avatar Sep 02 '22 11:09 vasvir

As requested,

Here is the query:

PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX oboInOwl: <http://www.geneontology.org/formats/oboInOwl#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?entity ?property ?value WHERE {
  VALUES ?property {
    obo:IAO_0000118
    oboInOwl:hasExactSynonym
    rdfs:label
  }
  { SELECT DISTINCT ?iname (COUNT(DISTINCT ?entity) AS ?cnt) WHERE {
    VALUES ?property {
      obo:IAO_0000118
      oboInOwl:hasExactSynonym
      rdfs:label
    }
    ?entity ?property ?name
    BIND(UCASE((?name)) AS ?iname)
    FILTER (!isBlank(?entity))
    FILTER NOT EXISTS { ?entity owl:deprecated true }
    } GROUP BY ?iname HAVING (?cnt > 1)
  } .
  ?entity ?property ?value
  FILTER (!isBlank(?entity))
  FILTER NOT EXISTS { ?entity owl:deprecated true }
  FILTER (UCASE(?value) = ?iname)
} ORDER BY ?iname ?entity

and here is the output:

?entity ?property       ?value
<http://purl.obolibrary.org/obo/UBERON_0002633> <http://www.geneontology.org/formats/oboInOwl#hasExactSynonym>  "nV"
<http://purl.obolibrary.org/obo/UBERON_2005093> <http://www.geneontology.org/formats/oboInOwl#hasExactSynonym>  "NV"
<http://purl.obolibrary.org/obo/UBERON_2000438> <http://www.geneontology.org/formats/oboInOwl#hasExactSynonym>  "phy"
<http://purl.obolibrary.org/obo/UBERON_8440028> <http://www.geneontology.org/formats/oboInOwl#hasExactSynonym>  "PHY"
<http://purl.obolibrary.org/obo/UBERON_0013560> <http://www.geneontology.org/formats/oboInOwl#hasExactSynonym>  "Prelimbic area"
<http://purl.obolibrary.org/obo/UBERON_8440032> <http://www.w3.org/2000/01/rdf-schema#label>    "prelimbic area"

vasvir avatar Sep 02 '22 11:09 vasvir

Ok, @vasvir now that you are almost there, can you also do the PR to add the check to Uberon? @anitacaron can you advice @vasvir how to do it (don't do it yourself!)

matentzn avatar Sep 02 '22 12:09 matentzn

yea I would love to.

Two more questions...

  1. Can you tell me where duplicate_exact_synonym query is located in the UBERON tree because for the life of me can't seem to find it. Is it on a docker image outside UBERON scope?

  2. Assuming it is merged it will break UBERON builds with 3 (6) new errors. Shouldn't I handle them first?

vasvir avatar Sep 02 '22 12:09 vasvir

ok of course I figured it out after posting.

  1. SPARQL scripts are in src/sparql
  2. and there is enough machinery there to start the new checks as a warning and then upgrade it to error where there are no errors.

vasvir avatar Sep 02 '22 12:09 vasvir

Hi @vasvir! Replaying your questions above:

  1. duplicate_exact_synonym is part of the ROBOT report, and it leaves in the robot tool. Here's the exactly SPARQL query that is used to generate this report. https://robot.obolibrary.org/report_queries/duplicate_exact_synonym. The complete list is here.
  2. The branch is only merged if all QC passes. The QC will not pass if the custom SPARQL query check returns some value, and the branch cannot be merged until the problem is solved.

I'll write down in the documentation the instructions to include a custom SPARQL query check to UBERON.

anitacaron avatar Sep 02 '22 13:09 anitacaron

Thanks for the quick reply @anitacaron

I will wait for the documentation but looks like I have to try to send PRs that handle the issues highlighted by this new SPARQL test before any merging can happen.

Hmm! Do you think I should strive for merging to robot instead? There is even an issue that is exactly that: https://github.com/ontodev/robot/issues/607 and @matentzn has been there...

vasvir avatar Sep 02 '22 13:09 vasvir

For more transparency and documentation, it would be better if you first create a PR with the new custom SPARQL check and then, after seeing the checks are not passing, create each PR to fix them, and link to the PR with the SPARQL check.

anitacaron avatar Sep 02 '22 14:09 anitacaron

Here are quick instructions on how to create a violation check:

Steps to add a constraint violation check:

  1. Add the SPARQL query in src/sparql. The file's name should end with -violation.sparql. Please give a name that helps understand the violation the query wants to check.

  2. Add the name of the new file to odk configuration file src/ontology/uberon-odk.yaml:

    1. Include the file's name (without the -violation.sparql part) to the list inside the custom_sparql_checks key, part of the robot_report key.
  3. Update the repository so the ODK will include the new SPARQL check in the QC.

sh run.sh make update_repo

Then you can create a PR, and the QC will run the new SPARQL check.

anitacaron avatar Sep 02 '22 15:09 anitacaron