ontobio icon indicating copy to clipboard operation
ontobio copied to clipboard

NCBITaxon: showing up in data files

Open kltm opened this issue 5 years ago • 4 comments

From @murphyte on July 6, 2018 20:23

Hi GO -- Some data files have recently shown a change where column 13 has the string "NCBITaxon:" with no value for some rows, and others are populated with "taxon:###". Previously all rows were populated with "taxon:###". Is this expected? We're seeing this in multiple files, for example: ftp://ftp.geneontology.org/pub/go/gene-associations/gene_association.ecocyc.gz

Thanks for investigating! -Terence

Copied from original issue: geneontology/helpdesk#141

kltm avatar Jul 18 '18 20:07 kltm

From @dougli1sqrd on July 6, 2018 23:40

It looks like this is pervasive and comes from PAINT.

edouglass@Erics-MBP:~/lbl/geneontology/go-site[sparta_rule_fix ?]$ curl -L http://release.geneontology.org/2018-07-02/products/annotations/paint_ecocyc.gaf.gz | gzip -dcf | cut -f13 | uniq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0!gaf-version: 2.1
!
!Generated by GO Central
!
!Original header below, sans version:
!Created on Thu Jun 28 14:00:26 2018.
!PANTHER version: v.13.1.
!GO version: 2018-06-01.
NCBITaxon:
100  216k  100  216k    0     0   357k      0 --:--:-- --:--:-- --:--:--  357k

kltm avatar Jul 18 '18 20:07 kltm

From @dougli1sqrd on July 6, 2018 23:58

This is also our parser's fault. We should have caught this. We'll start on a fix.

kltm avatar Jul 18 '18 20:07 kltm

From @pgaudet on July 11, 2018 8:42

@dougli1sqrd Should this be a GO Rule ?

kltm avatar Jul 18 '18 20:07 kltm

From @dougli1sqrd on July 18, 2018 1:11

This is copied from a gitter chat message I made earlier: It looks like the bad taxon ids are coming from PAINT. I downloaded the paint ecocyc file we have in the release and it has a bunch of NCBITaxon: entries in it. https://github.com/geneontology/helpdesk/issues/141#issuecomment-403171027 Investigating how the ids didn’t get switched is a little bit of a journey. So we internally use NCBITaxon:nnn has our taxon id. So when the gaf is being read we do a simple string replace of taxon -> NCBITaxon. So far so good. Then in the writer, we’re prepared to accept valid ids of the form taxon:nnnn or NCBITaxon:nnnn. If the id is the internal representation we switch to the external taxon and write it out. Otherwise (if the id is already taxon:nnnn - and this is checked) then we keep it. The catch is we merely return the given ID if we don’t match either case. Since NCBITaxon: does not match either regex, we just pass it through the writer and the gaf reader is also not fully validating the ID. I propose we push upstream to PAINT to let them know something might be wonky with their ids. We also need some ontobio changes to fix taxon id parsing. This should be caught by the parser at first. The Writer expects the data has already been cleaned. So this will be gaf parser work.

Just to be clear, this is fixed in the upstream data, so there is no longer bad data. We do still need to fix this in ontobio.

kltm avatar Jul 18 '18 20:07 kltm