Decide on graph walking vs EL inference for using taxon constraints to make subsets
Current strategy to make a taxon subset:
- Add axiom Thing subClassOf part-of some NCBITaxon:nnnn
- Remove any inter-species existentials
- homology represented as reciprocal existentials
- inter-species edges (less relevant for uberon)
- Ensure all TCs are EL-ified
- never_in becomes disjointWith in-taxon some X
- Ensure NCBITaxon GCIs are added (in-taxon some X disjoint with in-taxon some Y for all sibs)
- Reason with Elk
- Eliminate all unsatisfiables
Note this has issues if we have:
- A subclass part-of some B
- B never-in-taxon
See https://github.com/geneontology/go-annotation/issues/3942
@balhoff says he has a solution
Some additional issues with the approach
- hard to mentally reason over - logic is separated across ad-hoc filters on OPs, various pre-processing steps, property chains stored centrally in GO
- Can require 10s of gigs of memory
- this gets worse the further away from human we go
- Hard to 'customize'. E.g. GO may want to include occurs_in when making their subsets
Outline alternative strategies in this ticket
Whelk Strategy
@balhoff to fill in
Relation Graph strategy
proponent: @cmungall
- Run relation-graph over combined ontology (e.g. uberon + ncbitaxon)
- no special GCIs required, no pre-processing
- no taxon property chains required (but of course other prop chains included)
- Run sparql queries to to obtain exclusion criteria for a taxon t and property p
- EXCLUDE ?class IF: ?class ?p ?ancestor [inferred] . ?ancestor only-in-taxon ?t1 [direct] . NOT(?t subclass* ?t1) [inferred]
- EXCLUDE ?class IF: ?class ?p ?ancestor [inferred] . ?ancestor never-in-taxon ?t1 [direct] . ?t subclass* ?t1 [inferred]
- otherwise INCLUDE
Advantages
- very simple and easy to mentally reason over and IMO better corresponds to biologists mental models
- customizable. For p, plug in the top level relations than make sense. E.g.
(overlaps|occurs_in|...) - high guarantees of scalability
- we should already be running RG on our ontologies (at least for subsets of OPs)
- this is essentially what other groups are doing, e.g. interpro, ensembl (for filtering predicted annotations)
Disadvantage:
- does not generate unsats hence cannot use robot/protege explanation features. But I think this is OK if we can show the explanation for the core RG triple
hard to mentally reason over - logic is separated across ad-hoc filters on OPs, various pre-processing steps, property chains stored centrally in GO
The property chains are in RO, right? (although I think Uberon adds some itself). Filters and pre-processing seems to apply to all approaches.
Hard to 'customize'. E.g. GO may want to include occurs_in when making their subsets
One note on this: this is effectively included in the subset computation, because any existential to an unsatisfiable class is unsatisfiable. But it's not included in "normal" reasoning tasks checking the regular ontology classification. The subset computation is more aggressive.
@cmungall - your graph strategy looks reasonable to me.
@balhoff good point on how aggressive trimming is with the current strategy.
It would be good to see a side-by-side comparison. How many additional unwanted classes make it through if we switch to the graph-based approach? Maybe we could compare on the previous Uberon release?
It looks like the process with unsatisfiables will never scale, even with ELK, but I wonder if we can get a better sense of scaling with by running some tests with stripped down OWL files as input.
I appreciate that we're resource constrained right now for running these tests (unless someone in Chris's group can take this on). I think it's something that devs in my group could work on in the new year. Can we get by for now? Does anyone have a juiced up machine or access to a cluster we could use to run the current release?
@dosumis if you want @shawntanzk, @anitacaron and me to act on this, we need some specific instructions as it will fall to @anitacaron to take the bulk of this work, and it will occupy her for a few weeks (given she only has a few hours a week to dedicate to this project I mean).
From the meeting I gather we need to:
- [ ] Decide who should document, and where the documentation lives
- [ ] Decide on the curation strategy (how do we get the taxon constraints into the ontologies)
- [ ] Decide on the technical strategy to materialise the logical constraints (DOSDP vs SPARQL)
- [ ] Decide on the technical strategy on extracting taxon views on the basis of these constraints
@shawntanzk I think we can handle this after all, its just going to be a slow process. If you want, you can put it up on the board again for next week.
For the graph strategy, it is easy to explore this using the existing ubergraph instance, which includes relation-graph inferences
See this query: https://api.triplydb.com/s/8hs8rvxj3
Which is hardcoded to return classes EXCLUDED from a human view
Scroll up for an explanation of the query
Note that for demonstrative purposes, this is highly aggressive. For example, annotation shortcuts like spatially-disjoint-with are treated like any other triples in relation-graph (we should exclude not owl entailed NG in query). If there were homology assertions in any ontology, these would also be propagated over. However, it is trivial to exclude these either at sparql time or as a post-processing step
I introduced some steps to run RG-based taxon checks into the Makefile here #2160
This is NOT yet part of any build dependency. It also not fully tested. It relies on certain assumptions such that TCs in external ontologies use RO:0002161 triples for never-in-taxon.
Results of running on uberon-edit from Nov 1 here:
https://s3.amazonaws.com/bbop-ontologies/uberon/tmp/class-taxon-exclusions.tsv.gz
URL not guaranteed stable, this is for testing
The query hardcodes Human and Mammal. The TSV lists classes that are excluded for a given taxon, together with the reason.
Apologies for the duplication due to different labels of OPs:
| ?c | ?cLabel | ?p | ?pLabel | ?clsWithConstraint | ?clsWithConstraintLabel | ?taxonWithConstraint | ?taxonWithConstraintLabel | ?queryTaxon |
|---|---|---|---|---|---|---|---|---|
| http://purl.obolibrary.org/obo/UBERON_0003221 | "phalanx" | http://purl.obolibrary.org/obo/RO_0002202 | "develops_from" | http://purl.obolibrary.org/obo/UBERON_2001544 | "sublingual cartilage" | http://purl.obolibrary.org/obo/NCBITaxon_40674 | "Mammalia" | http://purl.obolibrary.org/obo/NCBITaxon_9606 |
| http://purl.obolibrary.org/obo/UBERON_0003221 | "phalanx" | http://purl.obolibrary.org/obo/RO_0002202 | "develops from"@en | http://purl.obolibrary.org/obo/UBERON_2001544 | "sublingual cartilage" | http://purl.obolibrary.org/obo/NCBITaxon_40674 | "Mammalia" | http://purl.obolibrary.org/obo/NCBITaxon_9606 |
| http://purl.obolibrary.org/obo/UBERON_0003221 | "phalanx" | http://purl.obolibrary.org/obo/RO_0002202 | "develops from" | http://purl.obolibrary.org/obo/UBERON_2001544 | "sublingual cartilage" | http://purl.obolibrary.org/obo/NCBITaxon_40674 | "Mammalia" | http://purl.obolibrary.org/obo/NCBITaxon_9606 |
This is obviously a false positve, caused by #2159 (if this is fixed, then 16642 lines will disappear from the file)
Note: would be great to have robot output saner TSVs, can anyone work on: https://github.com/ontodev/robot/issues/176
others are as-expected:
@dosumis - could you provide some guidance for how the tech team can proceed with this? thanks
This issue has not seen any activity in the past 6 months; it will be closed automatically in one year from now if no action is taken.
Should be reconsidered eventually
This issue has not seen any activity in the past 6 months; it will be closed automatically one year from now if no action is taken.
Since last year, this has been a low priority. If it should have a higher priority, please give some action items.
I think this is covered by the new subset command, we should double check
This issue has not seen any activity in the past 6 months; it will be closed automatically one year from now if no action is taken.