ontobio
ontobio copied to clipboard
Perform inference during association parsing
While parsing GAFs etc ontobio will construct expressions for each line and send them to an inference engine to determine if the annotations are (a) taxonomically invalid or otherwise logically incoherent rule 13 (b) can be inferred to a more specific term (rule 25)
This may be via owlery; it may be by sending a file to a command line wrapper; or calling owlapi via jpype (not favored)
The expressions will be tuples of the form:
(TaxonID GP2TermRelation TermID ExtensionExpression)
owl inference engine will turn this into
GP2TermRelation SOME (TermID AND ExtensionExpression)
and test for direct inferred class expressions (rule25). It will also test for incoherency (eg extension expression is wrong)
it will return a list of rel-term tuples
it will then test
(TermID AND in_taxon SOME TaxonID)
if this is unsatisfiable then rule13 fails.
ontobio will be in charge of collecting results and turning them into json structure.
Alternate architecture/workflow: standalone scala tools that does this over a GAF
The OWLTools code for rule25 is here: https://github.com/owlcollab/owltools/blob/41e1f585d7a8135081c3f2e56c4aa6590901227b/OWLTools-Annotation/src/main/java/owltools/gaf/inference/FoldBasedPredictor.java#L177
while we could continue to use this, this is awkward and there is poor separation of concerns.
For a GAF line with two taxa in column 13, which taxon should be used in the taxon restriction?
Always the first
there may be interesting things to do with the 2nd, but let's leave for now
I think we sort every list field, perhaps destroying the order of the taxons...
Clarification on Eric's point: we don't do this for taxon IDs:
https://github.com/biolink/ontobio/blob/fb622ba8ac83a8b1313bf814d492c44ee5708cec/ontobio/io/gafparser.py#L237-L246
The first is always the taxon ID of the gene
I just noticed we are not doing anything with the secondary taxon. Fixed a separate ticket #222
For the purposes of this ticket: we only care about the taxon of the gene. If |
present, remove it and everything after it
oh another observation not relevant to this ticket: the ==None in the code above is never satisfied, it should be ==''
.
Would it make sense to factor the extension into the taxon constraint test? E.g. TermID AND (in_taxon SOME TaxonID) AND ExtensionExpression
It seems like that might catch some additional problems.
Good point, yes, we should do this
Formally we should really check the relational expression (e.g. involved in X vs regulates some X), but the TCs are built with an involved in assumption for now, one step at a time
Developing here: https://github.com/balhoff/gaferencer
@cmungall can you explain this prediction from the owltools code:
https://github.com/owlcollab/owltools/blob/41e1f585d7a8135081c3f2e56c4aa6590901227b/OWLTools-Annotation/src/test/java/owltools/gaf/GAFInferenceTest.java#L40
I am getting the inference for FOO:3, but not FOO:1.
Here's the input: https://github.com/owlcollab/owltools/blob/41e1f585d7a8135081c3f2e56c4aa6590901227b/OWLTools-Annotation/src/test/resources/xp_inference_test.gaf