go-site icon indicating copy to clipboard operation
go-site copied to clipboard

gorule-0000027 misses some invalid ID in the with/field

Open pgaudet opened this issue 1 year ago • 9 comments

Hello,

@alexsign reported that some 'with' data in the exported Noctua GPADs contain "MGI" rather than "MGI:MGI". https://github.com/geneontology/go-site/blob/master/metadata/rules/gorule-0000027.md mentions that all db preixes should be found in the dbxref file

Note that the rule states

In all cases, the prefix MUST be in db-xrefs.yaml. The prefix SHOULD be identical (case-sensitive match) to the database field. If it does not match then it MUST be identical (case-sensitive) to one of the synonyms.

However for MGI the database field is MGI, not MGI:MGI.

@kltm do we need to change the dbxref to align with this?

pgaudet avatar Sep 06 '23 10:09 pgaudet

Assigning @kltm because we need your input to proceed with this.

pgaudet avatar Sep 06 '23 10:09 pgaudet

Isn't the prefix MGI? And the local ID values themselves also contain MGI:. Like this: (MGI:)(MGI:96182). So if the second MGI is missing, it's not a dbxrefs file problem, but a problem with the software or data? Sorry if I'm jumping into something without context!

balhoff avatar Sep 06 '23 14:09 balhoff

Re: "software or data?" The answer is both: the data is wrong according to our standards and we are not fixing it. In a perfect world, the first is not true and the failure of the second is not necessary. But, alas... IIRC, there is an issue about examining the IDs in the "with" column in the GORULES tracker somewhere. There should be some basic checking there, although we have never used the regexps that were added after our pipeline was established (IIRC, added by Tony later on to align metadata a little). MGI has always been a special case and, until we purge that historical choice from the data stream, it's something that we just have to deal with.

We'd have to look at the flow, but I believe all files (sans uniprot) pass through ontobio at some point and are parsed, so that would probably be the most expeditious place to catch things: python parse. Ideally, our internally produced files are not making the mistake when emitting data (i.e. minerva and PANTHER/PAINT), but as long as it doesn't make it out to end users, it doesn't matter too much. Unfortunately, that means that GO-CAM files /do/ get out as there is no QC occurring there--a running frustration.

I think that the best thing to do for the moment would be to:

  • [ ] make sure that minerva emits the correct identifier into TTL and and produced GPAD as a special case
  • [ ] that the rule is added and enforced in the python parsing
  • [ ] we do a one-time update of current TTL, if this error exists on our side

Again, any TTL/GO-CAM issues are "invisible" to us for the time being, so it's better to err on the side of caution.

kltm avatar Sep 06 '23 19:09 kltm

Noting too that the GPAD currently emitted by minerva is a bit between specs, IIRC. That makes it a little harder to define what should happen, but that's fine for the moment as long as it is internally consistent.

kltm avatar Sep 06 '23 19:09 kltm

Noting that GOA filters out this data (ie with that have single "MGI:" as the prefix).

pgaudet avatar Sep 07 '23 07:09 pgaudet

Related or same as https://github.com/geneontology/go-site/issues/1218

pgaudet avatar Sep 26 '23 10:09 pgaudet

From the test GAF, tests #4-9 are not failing.

  • [ ] test 4: Database prefix not in /db-xrefs.yaml
  • [ ] test 5: Assigned by not in groups
  • [ ] test 6-9 checks on references

pgaudet avatar Sep 26 '23 13:09 pgaudet

@mugitty It looks like at least the namespace of the 'with' (GAF column 8) is checked in gorule-0000001 (GORULE_TEST:0000001-19)

pgaudet avatar Nov 29 '23 14:11 pgaudet

So we should define exactly what is checked in gorule-0000001 and narrow the scope of gorule-0000027

GORULE_TEST:0000027-1 GORULE_TEST:0000027-2 GORULE_TEST:0000027-3 GORULE_TEST:0000027-8 are failing gorule-0000001

pgaudet avatar Nov 29 '23 14:11 pgaudet

Now - gorule-0000027 picks up tests 1, 3 and 4

! FAILS GORULE:0000027 - TEST 1 - Prefix not in /db-xrefs.yaml UniPotKB Q9HC96 CAPN10 involved_in GO:0006921 PMID:23072806 IDA P GORULE_TEST:0000027-1 Calpain-10 CAPN10,KIAA1845 protein taxon:9606 20140213 GO_Central

! FAILS GORULE:0000027 - TEST 3 - Bad reference syntax UniProtKB Q9HC96 CAPN10 involved_in GO:0006921 PMID:PMID:14561399 IDA P GORULE_TEST:0000027-3 Calpain-10 CAPN10,KIAA1845 protein taxon:9606 20140213 GO_Central

! FAILS GORULE:0000027 - TEST 4 - Bad reference syntax UniProtKB Q9HC96 CAPN10 involved_in GO:0006921 PMID:unpublished IDA P GORULE_TEST:0000027-4 Calpain-10 CAPN10,KIAA1845 protein taxon:9606 20140213 GO_Central

but not 2,5, and 6

! FAILS GORULE:0000027 - TEST 2 - Assigned_by not in /groups.yaml UniProtKB Q9HC96 CAPN10 involved_in GO:0006921 PMID:23072806 IDA P GORULE_TEST:0000027-2 Calpain-10 CAPN10,KIAA1845 protein taxon:9606 20140213 SGDDB

! FAILS GORULE:0000027 - TEST 5 - Bad referencesyntax UniProtKB Q9HC96 CAPN10 involved_in GO:0006921 PMID: IDA P GORULE_TEST:0000027-5 Calpain-10 CAPN10,KIAA1845 protein taxon:9606 20140213 GO_Central

OK, this is is the scope of GORULE-0000001 since there is no value at all after the namespace.

! FAILS GORULE:0000027 - TEST 6 - Bad reference syntax UniProtKB Q9HC96 CAPN10 involved_in GO:0006921 PMID:0. IDA P GORULE_TEST:0000027-6 Calpain-10 CAPN10,KIAA1845 protein taxon:9606 20140213 GO_Central

Should have been picked up ? ID syntax is

database: PMID id_syntax: '[0-9]+'

pgaudet avatar Jul 18 '24 16:07 pgaudet

GORULE-0000027 is also picking up tests that I was not expecting

GORULE_TEST:0000001-6 WARNING - Invalid identifier:GORULE:0000027: X not found in list of database names in dbxrefs--PomBase SPAC25B8.17 ypf1 is_active_in GO:0005634 GO_REF:0000024 ISO SGD:S000001583 C GORULE_TEST:0000001-6 intramembrane aspartyl protease of the perinuclear ER membrane Ypf1 (predicted) ppp81 protein taxon:4896 3/5/15 PomBase part_of(X:1)

  • GORULE_TEST:0000051-PASS1 WARNING - Invalid identifier:GORULE:0000027: 123456 does not match any id_syntax patterns for CL in dbxrefs--UniProtKB O76187 darA enables GO:0005515 PMID:9802899 IPI UniProtKB:P34149 F GORULE_TEST:0000051-PASS1 Darlin darA protein taxon:44689 20100205 GO_Central has_input(GO:0003674)|occurs_in(CL:123456)
  • WARNING - Invalid identifier:GORULE:0000027: FBrf0193169 does not match any id_syntax patterns for FB in dbxrefs--FB FBgn0011273 Acam part_of GO:0008023 FB:FBrf0193169|PMID:16790438 IDA C GORULE_TEST:0000061-1 Androcam ACaM|And|CG17769|CalB|Calmodulin-related 97A|Camr97A|androcalmodulin|androcam protein taxon:7227 20180501 GO_Central
  • WARNING - Invalid identifier:GORULE:0000027: 3836072 does not match any id_syntax patterns for MGI in dbxrefs--MGI:1100518 Smad7 bla involved_in GO:0017015 MGI:MGI:3836072|PMID:18952608 IC GO:0060389 P GORULE_TEST:0000020-3 SMAD protein_coding_gene taxon:10090 20090211 GO_Central
  • WARNING - Invalid identifier:GORULE:0000027: UniProtKB-SubCell not found in list of database names in dbxrefs--UniProtKB P77335 hlyE located_in GO:0020002 GO_REF:0000044 IEA UniProtKB-SubCell:SL-0375 C GORULE_TEST:0000029-1 protein taxon:83333 20220807 GO_Central
  • WARNING - Invalid identifier:GORULE:0000027: UniProtKB-SubCell not found in list of database names in dbxrefs--UniProtKB P77335 hlyE located_in GO:0020002 GO_REF:0000044 IEA UniProtKB-SubCell:SL-0375 C GORULE_TEST:0000029-2 protein taxon:83333 20200507 GO_Central
  • WARNING - Invalid identifier:GORULE:0000027: UniProtKB-SubCell not found in list of database names in dbxrefs--UniProtKB P77335 hlyE located_in GO:0020002 GO_REF:0000045 IEA UniProtKB-SubCell:SL-0375 C GORULE_TEST:0000030-1 protein taxon:83333 20230607 GO_Central

pgaudet avatar Jul 18 '24 16:07 pgaudet

! FAILS GORULE:0000027 - TEST 5 - Bad referencesyntax UniProtKB Q9HC96 CAPN10 involved_in GO:0006921 PMID: IDA P GORULE_TEST:0000027-5 Calpain-10 CAPN10,KIAA1845 protein taxon:9606 20140213 GO_Central

from test for gorule-0000027-5 to test for gorule-0000001-29 (and renamed gorule-0000027-6 to gorule-0000027-5 to avoid gaps)

  • Remove GORULE_TEST:0000027-5

pgaudet avatar Jul 25 '24 13:07 pgaudet

All tests are failing as expected.

pgaudet avatar Jul 25 '24 14:07 pgaudet