uberon polysemy issues
Hi,
Trying to use UBERON (uberon-simple.obo from https://github.com/obophenotype/uberon/releases/download/v2022-04-05/uberon-simple.obo) for text mining when I noticed the following polysemy issues.
The following terms are assigned in more than one UBERON Ids using only the term name and the EXACT synonym tags.
Uberon terms
| Gasserian ganglion | 2 | UBERON:0001675, UBERON:3011045 |
| nV | 2 | UBERON:0002633, UBERON:2005093 |
| temporal pole | 2 | UBERON:0002576, UBERON:0006479 |
| trochlea of femur | 2 | UBERON:4200096, UBERON:7500077 |
Bug description
How should I proceed resolving these polysemy issues? Should I try to create individual Issue/PR for each term?
Apologies if this is not the correct place to raise such an issue. I looked for a mailing list or a forum but I didn't find any. Any pointers to right direction are much appreciated.
Thanks in advance.
Vassilis
Thank you @vasvir!
How comfortable are you editing text files?
If you make a pull request in this file:
https://github.com/obophenotype/uberon/blob/master/src/ontology/uberon-edit.obo
you can change the synonyms from exact to related. there by search and replace. Tag me in PR!
will assign ticket to you @vasvir (please let us know if you need help etc.) as I'm trying to make sure tickets are assigned in uberon :) thanks
Hi @matentzn and everybody else. I would like your advice if possible.
I have created (retrofitted actually) a perl script that checks an OBO ontology for polysemy issues. Here is how the output likes:
$ src/scripts/obo_check_polysemy.pl src/ontology/uberon-edit.obo
Set synonym_level: EXACT case sensitivity: 0 for file: src/ontology/uberon-edit.obo
2 ids for name: 'nv' ids: UBERON:0002633, UBERON:2005093,
2 ids for name: 'phy' ids: UBERON:2000438, UBERON:8440028,
2 ids for name: 'prelimbic area' ids: UBERON:0013560, UBERON:8440032,
Problems found: 3
$ src/scripts//obo_check_polysemy.pl -h
Usage: /home/bill/workspace/processing_scripts/obo_check_polysemy.pl [-l|--synonym-level {EXACT|NARROW|RELATED|BROAD}] [-s|case-sensitive] ontology-file.obo
I wonder if you would consider this to become part of the project's CI (provided I will fix the existing issues). If there is interest in this I will create the necessary PR (with some much appreciated guidance) of course.
I am totally flexible on the license.
@vasvir, thank you for working on this.
There are two things to consider here:
- We are trying very hard to burying PERL as a dependency. This is not a problem however, because we can help you migrate your test to SPARQL instead, once we understand it.
- I don't yet understand what ploysemy you are checking. We have a ton of QC against it, like https://robot.obolibrary.org/report_queries/duplicate_exact_synonym. Can you explain, for example, the three errors you found?
I am not familiar with SPARQL (that will not be a problem though) but the link you sent looks like exactly what I am trying to accomplish (maybe in a case insensitive way).
Is there a way to run this test locally given I have a current git checkout of the project?
Thanks for the quick reply.
sh run.sh make robot_reports
If you have docker installed!
Thanks a lot @matentzn That covers a lot of ground.
I managed to run it and everything comes out nicely. The existing checks cover the 99% of the cases. Here are my 2 remaining issues.
-
The SPARQL tests do not test if there is a Id1:name/id2: EXACT synonym discrepancy (only synonym vs synonym). Such a case was handled by PR #2470 where the label of UBERON:0002576 (temporal pole) was exact synonym of UBERON:0006479
-
The tests are not case insensitive so nV != NV. I understand this is a much harder sell but consider that a) for applications like text mining a case insensitive search is often preferred and b) a case insensitive class often reveals a hasty abbreviation.
So if there is any interest I am prepared to give them a try.
- Can you review this check and see if you see anything wrong with it? https://robot.obolibrary.org/report_queries/duplicate_label_synonym I hoped that check would cover it!
- This is true. Good point. It would require us though to duplicate all checks to take into account casing.. I would like to know the impact of that, and if we can at least reduce it to a single query (maybe a modification of https://robot.obolibrary.org/report_queries/duplicate_label_synonym)? Here is a potential query to test:
PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX oboInOwl: <http://www.geneontology.org/formats/oboInOwl#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?entity ?property ?value WHERE {
VALUES ?property {
obo:IAO_0000118
oboInOwl:hasExactSynonym
rdfs:label
}
FILTER NOT EXISTS { ?entity owl:deprecated true }
FILTER NOT EXISTS { ?entity2 owl:deprecated true }
?entity rdfs:subClassOf <http://purl.obolibrary.org/obo/UBERON_0001062> .
?entity2 rdfs:subClassOf <http://purl.obolibrary.org/obo/UBERON_0001062> .
?entity ?propery ?value .
?entity2 ?property ?value2 .
FILTER (!isBlank(?entity))
FILTER (!isBlank(?entity2))
FILTER(lcase(str(?value)) = lcase(str(?value2)))
}
ORDER BY ?entity
LIMIT 1
to test:
sh run.sh robot query -i uberon-edit.obo --query query.sparql output.tsv
This is fun! Thanks a lot.
-
duplicate_label_synonym sort of works. It issues a warning alright which is correct if the label/synonym are in the same Term (id). In that case the duplication is not an error. However when the label = synonym across different terms (ids) then it should throw an error and not a warning as it currently does IMHO.
-
duplicate_label_synonym maybe is a good place to start but I had my mind on duplicate_exact_synonym that needs to be case insensitive also. Furthermore it features the FILTER (?entity != ?entity2) which strikes me as a good condition.
But I test your variant of duplicate_label_synonym. I don't understand the subClassOf lines. If I remove them takes for ever. I remove the LIMIT however and I get these which are not correct IMO.
?entity ?property ?value
<http://purl.obolibrary.org/obo/UBERON_0000465> <http://www.w3.org/2000/01/rdf-schema#label> "material anatomical entity"
<http://purl.obolibrary.org/obo/UBERON_0000466> <http://www.geneontology.org/formats/oboInOwl#hasExactSynonym> "immaterial physical anatomical entity"
<http://purl.obolibrary.org/obo/UBERON_0000466> <http://www.w3.org/2000/01/rdf-schema#label> "immaterial anatomical entity"
<http://purl.obolibrary.org/obo/UBERON_0000477> <http://www.w3.org/2000/01/rdf-schema#label> "anatomical cluster"
<http://purl.obolibrary.org/obo/UBERON_6004520> <http://www.w3.org/2000/01/rdf-schema#label> "insect mouthpart"
I think this query returns everything because it check e1.label = e2.label or e1.synonym = e2.synonym even when e1 = e2 which obviously holds and the subClassOf is a way to limit the output for testing purposes.
Let me educate my self a bit in SPARQL and I will be back with more queries (pun semi-intended :smile:).
Hahah have fun @vasvir!!
ERROR levels are configurable. In my QC police view of the world, all checks should be set to ERROR. And Exceptions hard coded into the sparql query!
Here is how we handle "exceptions to the rule" in Mondo: https://github.com/monarch-initiative/mondo/blob/master/src/sparql/qc/mondo/qc-excluded-subsumption-is-inferred.sparql#L15
That way you can set everything to ERROR, which means QC will fail, but still have the flexibility to grant valid exceptions.
BTW: https://oboacademy.github.io/obook/tutorial/sparql/
Thanks for the pointers @matentzn
I tried the naive approach with a (implicit) self join which worked for my minimal test case but was running for ever in the full UBERON.
I changed strategy and used GROUP BY and COUNT DISTINCT and that gave the query planner a chance to optimize the query reasonably.
The output of the query below has two enhancements over existing duplicate_exact_synonym.
- Checks also names (not only synonyms). All combinations are covered (name-name, name-synonym, synonym-synonym)
- It works in a case insensitive manner
Possible issues are that the output is not in the form of the previous query.
The output is the same as my original perl script as it should be flagging 3 errors.
?names ?cnt ?ids
"NV; nV" 2 "http://purl.obolibrary.org/obo/UBERON_2005093; http://purl.obolibrary.org/obo/UBERON_0002633"
"Prelimbic area; prelimbic area" 2 "http://purl.obolibrary.org/obo/UBERON_0013560; http://purl.obolibrary.org/obo/UBERON_8440032"
"PHY; phy" 2 "http://purl.obolibrary.org/obo/UBERON_8440028; http://purl.obolibrary.org/obo/UBERON_2000438"
The query itself is here:
PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX oboInOwl: <http://www.geneontology.org/formats/oboInOwl#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT (GROUP_CONCAT(DISTINCT ?name; SEPARATOR="; ") as ?names) (COUNT(DISTINCT ?entity) AS ?cnt) (GROUP_CONCAT(DISTINCT ?entity; SEPARATOR="; ") as ?ids) WHERE {
VALUES ?property {
obo:IAO_0000118
oboInOwl:hasExactSynonym
rdfs:label
}
?entity ?property ?name
BIND(UCASE((?name)) AS ?iname)
FILTER (!isBlank(?entity))
FILTER NOT EXISTS { ?entity owl:deprecated true }
} GROUP BY ?iname HAVING (?cnt > 1)
Awesome! Can you try to return the query to the required form using subquery syntax?
PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX oboInOwl: <http://www.geneontology.org/formats/oboInOwl#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?entity ?property ?value {
{
SELECT DISTINCT (GROUP_CONCAT(DISTINCT ?name; SEPARATOR="; ") as ?names) (COUNT(DISTINCT ?entity) AS ?cnt) (GROUP_CONCAT(DISTINCT ?entity; SEPARATOR="; ") as ?ids) WHERE {
VALUES ?property {
obo:IAO_0000118
oboInOwl:hasExactSynonym
rdfs:label
}
?entity ?property ?name
BIND(UCASE((?name)) AS ?iname)
FILTER (!isBlank(?entity))
FILTER NOT EXISTS { ?entity owl:deprecated true }
# BIND(?name as ?value)
} GROUP BY ?iname HAVING (?cnt > 1)
}
}
? (The above is just for illustration)
I can try but just to be on the same page here.
The errors this query catches here span multiple ids (2 in all 3 examples) so the user needs to have both ids to figure it out.
If I (manage to) use a subquery I will essentially multiply the error. There is no way around that...
As requested,
Here is the query:
PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX oboInOwl: <http://www.geneontology.org/formats/oboInOwl#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?entity ?property ?value WHERE {
VALUES ?property {
obo:IAO_0000118
oboInOwl:hasExactSynonym
rdfs:label
}
{ SELECT DISTINCT ?iname (COUNT(DISTINCT ?entity) AS ?cnt) WHERE {
VALUES ?property {
obo:IAO_0000118
oboInOwl:hasExactSynonym
rdfs:label
}
?entity ?property ?name
BIND(UCASE((?name)) AS ?iname)
FILTER (!isBlank(?entity))
FILTER NOT EXISTS { ?entity owl:deprecated true }
} GROUP BY ?iname HAVING (?cnt > 1)
} .
?entity ?property ?value
FILTER (!isBlank(?entity))
FILTER NOT EXISTS { ?entity owl:deprecated true }
FILTER (UCASE(?value) = ?iname)
} ORDER BY ?iname ?entity
and here is the output:
?entity ?property ?value
<http://purl.obolibrary.org/obo/UBERON_0002633> <http://www.geneontology.org/formats/oboInOwl#hasExactSynonym> "nV"
<http://purl.obolibrary.org/obo/UBERON_2005093> <http://www.geneontology.org/formats/oboInOwl#hasExactSynonym> "NV"
<http://purl.obolibrary.org/obo/UBERON_2000438> <http://www.geneontology.org/formats/oboInOwl#hasExactSynonym> "phy"
<http://purl.obolibrary.org/obo/UBERON_8440028> <http://www.geneontology.org/formats/oboInOwl#hasExactSynonym> "PHY"
<http://purl.obolibrary.org/obo/UBERON_0013560> <http://www.geneontology.org/formats/oboInOwl#hasExactSynonym> "Prelimbic area"
<http://purl.obolibrary.org/obo/UBERON_8440032> <http://www.w3.org/2000/01/rdf-schema#label> "prelimbic area"
Ok, @vasvir now that you are almost there, can you also do the PR to add the check to Uberon? @anitacaron can you advice @vasvir how to do it (don't do it yourself!)
yea I would love to.
Two more questions...
-
Can you tell me where duplicate_exact_synonym query is located in the UBERON tree because for the life of me can't seem to find it. Is it on a docker image outside UBERON scope?
-
Assuming it is merged it will break UBERON builds with 3 (6) new errors. Shouldn't I handle them first?
ok of course I figured it out after posting.
- SPARQL scripts are in src/sparql
- and there is enough machinery there to start the new checks as a warning and then upgrade it to error where there are no errors.
Hi @vasvir! Replaying your questions above:
- duplicate_exact_synonym is part of the ROBOT report, and it leaves in the robot tool. Here's the exactly SPARQL query that is used to generate this report. https://robot.obolibrary.org/report_queries/duplicate_exact_synonym. The complete list is here.
- The branch is only merged if all QC passes. The QC will not pass if the custom SPARQL query check returns some value, and the branch cannot be merged until the problem is solved.
I'll write down in the documentation the instructions to include a custom SPARQL query check to UBERON.
Thanks for the quick reply @anitacaron
I will wait for the documentation but looks like I have to try to send PRs that handle the issues highlighted by this new SPARQL test before any merging can happen.
Hmm! Do you think I should strive for merging to robot instead? There is even an issue that is exactly that: https://github.com/ontodev/robot/issues/607 and @matentzn has been there...
For more transparency and documentation, it would be better if you first create a PR with the new custom SPARQL check and then, after seeing the checks are not passing, create each PR to fix them, and link to the PR with the SPARQL check.
Here are quick instructions on how to create a violation check:
Steps to add a constraint violation check:
-
Add the SPARQL query in
src/sparql. The file's name should end with-violation.sparql. Please give a name that helps understand the violation the query wants to check. -
Add the name of the new file to odk configuration file
src/ontology/uberon-odk.yaml:- Include the file's name (without the
-violation.sparqlpart) to the list inside thecustom_sparql_checkskey, part of therobot_reportkey.
- Include the file's name (without the
-
Update the repository so the ODK will include the new SPARQL check in the QC.
sh run.sh make update_repo
Then you can create a PR, and the QC will run the new SPARQL check.