ontobio icon indicating copy to clipboard operation
ontobio copied to clipboard

Test for redundancy when parsing association files

Open cmungall opened this issue 5 years ago • 3 comments

To keep the runtime down and efficient, can make assumption that gene products are always continguous. redundancy only needs tested on a per gp basis.

behavior could be configurable, but gross redundancies should be filtered directly. non-gross redundancies can be tagged. When loading amigo we can add a golr field such that this will be toggleable.

see https://github.com/geneontology/amigo/issues/295 https://github.com/geneontology/amigo/issues/43

cmungall avatar Aug 16 '18 17:08 cmungall

A nice side-effect is ease of doing analyses of the form: What is the information loss when we take out method/evidence X. E.g. SPKW

cmungall avatar Aug 16 '18 21:08 cmungall

I also wonder if the redundant information is never useful; for example to compare different coverage by PAINT vs InterProt, or to compare whether IEAs are confirmed experimentally, etc ... ?

pgaudet avatar Aug 17 '18 09:08 pgaudet

These are definitely important. Some of these could still be done by GOC team by working with the upstream files. But it's also good to make the release files as informative as possible as these are ingested and displayed in multiple different systems. Maybe we need two sets of release files? TBD

For now we must be conservative in what we filter. I think it's still OK to filter something like a F-P inference that is redundant with a P or the like.

cmungall avatar Aug 17 '18 15:08 cmungall