uberon Decide on graph walking vs EL inference for using taxon constraints to make subsets

Current strategy to make a taxon subset:

Add axiom Thing subClassOf part-of some NCBITaxon:nnnn
Remove any inter-species existentials
- homology represented as reciprocal existentials
- inter-species edges (less relevant for uberon)
Ensure all TCs are EL-ified
- never_in becomes disjointWith in-taxon some X
- Ensure NCBITaxon GCIs are added (in-taxon some X disjoint with in-taxon some Y for all sibs)
Reason with Elk
Eliminate all unsatisfiables

Note this has issues if we have:

A subclass part-of some B
B never-in-taxon

See https://github.com/geneontology/go-annotation/issues/3942

@balhoff says he has a solution

Some additional issues with the approach

hard to mentally reason over - logic is separated across ad-hoc filters on OPs, various pre-processing steps, property chains stored centrally in GO
Can require 10s of gigs of memory
- this gets worse the further away from human we go
Hard to 'customize'. E.g. GO may want to include occurs_in when making their subsets

Outline alternative strategies in this ticket

Oct 28 '21 19:10 cmungall

Whelk Strategy

@balhoff to fill in

Oct 28 '21 19:10 cmungall

Relation Graph strategy

proponent: @cmungall

Run relation-graph over combined ontology (e.g. uberon + ncbitaxon)
- no special GCIs required, no pre-processing
- no taxon property chains required (but of course other prop chains included)
Run sparql queries to to obtain exclusion criteria for a taxon t and property p
- EXCLUDE ?class IF: ?class ?p ?ancestor [inferred] . ?ancestor only-in-taxon ?t1 [direct] . NOT(?t subclass* ?t1) [inferred]
- EXCLUDE ?class IF: ?class ?p ?ancestor [inferred] . ?ancestor never-in-taxon ?t1 [direct] . ?t subclass* ?t1 [inferred]
- otherwise INCLUDE

Advantages

very simple and easy to mentally reason over and IMO better corresponds to biologists mental models
customizable. For p, plug in the top level relations than make sense. E.g. (overlaps|occurs_in|...)
high guarantees of scalability
- we should already be running RG on our ontologies (at least for subsets of OPs)
this is essentially what other groups are doing, e.g. interpro, ensembl (for filtering predicted annotations)

Disadvantage:

does not generate unsats hence cannot use robot/protege explanation features. But I think this is OK if we can show the explanation for the core RG triple

Oct 28 '21 19:10 cmungall

hard to mentally reason over - logic is separated across ad-hoc filters on OPs, various pre-processing steps, property chains stored centrally in GO

The property chains are in RO, right? (although I think Uberon adds some itself). Filters and pre-processing seems to apply to all approaches.

Hard to 'customize'. E.g. GO may want to include occurs_in when making their subsets

One note on this: this is effectively included in the subset computation, because any existential to an unsatisfiable class is unsatisfiable. But it's not included in "normal" reasoning tasks checking the regular ontology classification. The subset computation is more aggressive.

Oct 28 '21 19:10 balhoff

@cmungall - your graph strategy looks reasonable to me.
@balhoff good point on how aggressive trimming is with the current strategy.

It would be good to see a side-by-side comparison. How many additional unwanted classes make it through if we switch to the graph-based approach? Maybe we could compare on the previous Uberon release?

It looks like the process with unsatisfiables will never scale, even with ELK, but I wonder if we can get a better sense of scaling with by running some tests with stripped down OWL files as input.

I appreciate that we're resource constrained right now for running these tests (unless someone in Chris's group can take this on). I think it's something that devs in my group could work on in the new year. Can we get by for now? Does anyone have a juiced up machine or access to a cluster we could use to run the current release?

Oct 29 '21 10:10 dosumis

@dosumis if you want @shawntanzk, @anitacaron and me to act on this, we need some specific instructions as it will fall to @anitacaron to take the bulk of this work, and it will occupy her for a few weeks (given she only has a few hours a week to dedicate to this project I mean).

From the meeting I gather we need to:

[ ] Decide who should document, and where the documentation lives
[ ] Decide on the curation strategy (how do we get the taxon constraints into the ontologies)
[ ] Decide on the technical strategy to materialise the logical constraints (DOSDP vs SPARQL)
[ ] Decide on the technical strategy on extracting taxon views on the basis of these constraints

@shawntanzk I think we can handle this after all, its just going to be a slow process. If you want, you can put it up on the board again for next week.

Nov 08 '21 21:11 matentzn

For the graph strategy, it is easy to explore this using the existing ubergraph instance, which includes relation-graph inferences

See this query: https://api.triplydb.com/s/8hs8rvxj3

Which is hardcoded to return classes EXCLUDED from a human view

Scroll up for an explanation of the query

Note that for demonstrative purposes, this is highly aggressive. For example, annotation shortcuts like spatially-disjoint-with are treated like any other triples in relation-graph (we should exclude not owl entailed NG in query). If there were homology assertions in any ontology, these would also be propagated over. However, it is trivial to exclude these either at sparql time or as a post-processing step

Nov 08 '21 21:11 cmungall

I introduced some steps to run RG-based taxon checks into the Makefile here #2160

This is NOT yet part of any build dependency. It also not fully tested. It relies on certain assumptions such that TCs in external ontologies use RO:0002161 triples for never-in-taxon.

Results of running on uberon-edit from Nov 1 here:

https://s3.amazonaws.com/bbop-ontologies/uberon/tmp/class-taxon-exclusions.tsv.gz

URL not guaranteed stable, this is for testing

The query hardcodes Human and Mammal. The TSV lists classes that are excluded for a given taxon, together with the reason.

Apologies for the duplication due to different labels of OPs:

?c	?cLabel	?p	?pLabel	?clsWithConstraint	?clsWithConstraintLabel	?taxonWithConstraint	?taxonWithConstraintLabel	?queryTaxon
http://purl.obolibrary.org/obo/UBERON_0003221	"phalanx"	http://purl.obolibrary.org/obo/RO_0002202	"develops_from"	http://purl.obolibrary.org/obo/UBERON_2001544	"sublingual cartilage"	http://purl.obolibrary.org/obo/NCBITaxon_40674	"Mammalia"	http://purl.obolibrary.org/obo/NCBITaxon_9606
http://purl.obolibrary.org/obo/UBERON_0003221	"phalanx"	http://purl.obolibrary.org/obo/RO_0002202	"develops from"@en	http://purl.obolibrary.org/obo/UBERON_2001544	"sublingual cartilage"	http://purl.obolibrary.org/obo/NCBITaxon_40674	"Mammalia"	http://purl.obolibrary.org/obo/NCBITaxon_9606
http://purl.obolibrary.org/obo/UBERON_0003221	"phalanx"	http://purl.obolibrary.org/obo/RO_0002202	"develops from"	http://purl.obolibrary.org/obo/UBERON_2001544	"sublingual cartilage"	http://purl.obolibrary.org/obo/NCBITaxon_40674	"Mammalia"	http://purl.obolibrary.org/obo/NCBITaxon_9606

This is obviously a false positve, caused by #2159 (if this is fixed, then 16642 lines will disappear from the file)

Note: would be great to have robot output saner TSVs, can anyone work on: https://github.com/ontodev/robot/issues/176

others are as-expected:

?c	?cLabel	?p	?pLabel	?clsWithConstraint	?clsWithConstraintLabel	?taxonWithConstraint	?taxonWithConstraintLabel	?queryTaxon
http://purl.obolibrary.org/obo/UBERON_8200004	"copepodite stage 3"	http://purl.obolibrary.org/obo/BFO_0000050	"part of"@en	http://purl.obolibrary.org/obo/UBERON_0000069	"larval stage"	http://purl.obolibrary.org/obo/NCBITaxon_32524	"Amniota"	http://purl.obolibrary.org/obo/NCBITaxon_40674
http://purl.obolibrary.org/obo/UBERON_4200208	"pectoral fin intermediate radial bone"	http://purl.obolibrary.org/obo/RO_0002202	"develops from"	http://purl.obolibrary.org/obo/UBERON_2001456	"pectoral fin endoskeletal disc"	http://purl.obolibrary.org/obo/NCBITaxon_40674	"Mammalia"	http://purl.obolibrary.org/obo/NCBITaxon_9606
http://purl.obolibrary.org/obo/UBERON_4500010	"unbranched pectoral fin ray"	http://purl.obolibrary.org/obo/BFO_0000050	"part of"@en	http://purl.obolibrary.org/obo/UBERON_0002534	"paired fin"	http://purl.obolibrary.org/obo/NCBITaxon_32523	"Tetrapoda"	http://purl.obolibrary.org/obo/NCBITaxon_40674

Nov 09 '21 21:11 cmungall

@dosumis - could you provide some guidance for how the tech team can proceed with this? thanks

Nov 22 '21 14:11 shawntanzk

This issue has not seen any activity in the past 6 months; it will be closed automatically in one year from now if no action is taken.

May 29 '22 03:05 github-actions[bot]

Should be reconsidered eventually

May 30 '22 08:05 matentzn

This issue has not seen any activity in the past 6 months; it will be closed automatically one year from now if no action is taken.

Feb 08 '23 02:02 github-actions[bot]

Since last year, this has been a low priority. If it should have a higher priority, please give some action items.

Feb 14 '23 10:02 anitacaron

I think this is covered by the new subset command, we should double check

Feb 14 '23 10:02 matentzn

This issue has not seen any activity in the past 6 months; it will be closed automatically one year from now if no action is taken.

Aug 17 '23 01:08 github-actions[bot]