ontobio icon indicating copy to clipboard operation
ontobio copied to clipboard

Perform inference during association parsing

Open cmungall opened this issue 5 years ago • 10 comments

While parsing GAFs etc ontobio will construct expressions for each line and send them to an inference engine to determine if the annotations are (a) taxonomically invalid or otherwise logically incoherent rule 13 (b) can be inferred to a more specific term (rule 25)

This may be via owlery; it may be by sending a file to a command line wrapper; or calling owlapi via jpype (not favored)

The expressions will be tuples of the form:

(TaxonID GP2TermRelation TermID ExtensionExpression)

owl inference engine will turn this into

GP2TermRelation SOME (TermID AND ExtensionExpression)

and test for direct inferred class expressions (rule25). It will also test for incoherency (eg extension expression is wrong)

it will return a list of rel-term tuples

it will then test

(TermID AND in_taxon SOME TaxonID)

if this is unsatisfiable then rule13 fails.

ontobio will be in charge of collecting results and turning them into json structure.

Alternate architecture/workflow: standalone scala tools that does this over a GAF

cmungall avatar Aug 16 '18 17:08 cmungall

The OWLTools code for rule25 is here: https://github.com/owlcollab/owltools/blob/41e1f585d7a8135081c3f2e56c4aa6590901227b/OWLTools-Annotation/src/main/java/owltools/gaf/inference/FoldBasedPredictor.java#L177

while we could continue to use this, this is awkward and there is poor separation of concerns.

cmungall avatar Aug 16 '18 18:08 cmungall

For a GAF line with two taxa in column 13, which taxon should be used in the taxon restriction?

balhoff avatar Aug 16 '18 19:08 balhoff

Always the first

there may be interesting things to do with the 2nd, but let's leave for now

cmungall avatar Aug 16 '18 20:08 cmungall

I think we sort every list field, perhaps destroying the order of the taxons...

dougli1sqrd avatar Aug 16 '18 22:08 dougli1sqrd

Clarification on Eric's point: we don't do this for taxon IDs:

https://github.com/biolink/ontobio/blob/fb622ba8ac83a8b1313bf814d492c44ee5708cec/ontobio/io/gafparser.py#L237-L246

The first is always the taxon ID of the gene

I just noticed we are not doing anything with the secondary taxon. Fixed a separate ticket #222

For the purposes of this ticket: we only care about the taxon of the gene. If | present, remove it and everything after it

cmungall avatar Aug 16 '18 22:08 cmungall

oh another observation not relevant to this ticket: the ==None in the code above is never satisfied, it should be ==''.

cmungall avatar Aug 17 '18 00:08 cmungall

Would it make sense to factor the extension into the taxon constraint test? E.g. TermID AND (in_taxon SOME TaxonID) AND ExtensionExpression

It seems like that might catch some additional problems.

balhoff avatar Aug 17 '18 01:08 balhoff

Good point, yes, we should do this

Formally we should really check the relational expression (e.g. involved in X vs regulates some X), but the TCs are built with an involved in assumption for now, one step at a time

cmungall avatar Aug 17 '18 06:08 cmungall

Developing here: https://github.com/balhoff/gaferencer

balhoff avatar Aug 18 '18 16:08 balhoff

@cmungall can you explain this prediction from the owltools code:

https://github.com/owlcollab/owltools/blob/41e1f585d7a8135081c3f2e56c4aa6590901227b/OWLTools-Annotation/src/test/java/owltools/gaf/GAFInferenceTest.java#L40

I am getting the inference for FOO:3, but not FOO:1.

Here's the input: https://github.com/owlcollab/owltools/blob/41e1f585d7a8135081c3f2e56c4aa6590901227b/OWLTools-Annotation/src/test/resources/xp_inference_test.gaf

balhoff avatar Aug 21 '18 14:08 balhoff