amigo icon indicating copy to clipboard operation
amigo copied to clipboard

Gramene appears to have duplicated entries

Open pgaudet opened this issue 6 years ago • 13 comments

image

I m not sure at which step the merge of these two entries is not done.

Thanks, Pascale

pgaudet avatar Sep 11 '18 07:09 pgaudet

These appear to be two "separate" entities, from two different sources, that happen to have the same local ID and symbol (but do not collide as they are in different namespaces), and are treated as such: http://amigo.geneontology.org/amigo/gene_product/GR_protein:Q84MN8 http://amigo.geneontology.org/amigo/gene_product/UniProtKB:Q84MN8 The fix here would be in either PAINT or the upstream GAF.

kltm avatar Sep 11 '18 15:09 kltm

@kltm who provides these namespaces ? Evidently GR uses one namespace and the annotation provided by PAINT, another.

Are there instructions as to which namespace should be provided for each species ? Could the pipeline reconcile these ID by comparing the IDs?

Thanks, Pascale

pgaudet avatar Sep 13 '18 17:09 pgaudet

@pgaudet These are whatever is in the upstream GAFs. Namespaces are by resource, not species, enforced by convention.

kltm avatar Sep 13 '18 17:09 kltm

I don't understand why the incorrect namespace for submitter can't be captured by GO syntax checks an reported back to the submitter. We have a file describing how they should be called.

ValWood avatar Sep 13 '18 17:09 ValWood

Yes, a new rule to enforce certain namespaces to certain files could be created and enforced. I think this is something that probably needs discussion.

kltm avatar Sep 13 '18 17:09 kltm

Is is possible to know the extent of the problem ? ie how many identical ids live in different name spaces ?

pgaudet avatar Sep 13 '18 17:09 pgaudet

@pgaudet Give me a minute...

kltm avatar Sep 13 '18 17:09 kltm

Is is possible to know the extent of the problem ? ie how many identical ids live in different name spaces ?

I don't think ID clashes is really the issue.

WormBase and FlyBase presumably only want one namespace, unless these namespaces have different meanings and that isn't clear. The namespace should presumably be the one in https://github.com/geneontology/go-site/blob/master/metadata/db-xrefs.yaml

The main issue is that we often use "assigned by" to filter in on a particular groups' annotation, and it shouldn't be split between "2" resources, even if only for stats, and sanity.

Having a restriction on the namespace would prevent all sorts of frequently occurring problems (like UniprotKB/UniProt).

I would vote that this just happens, it would close a lot of tickets, and save a lot of time. what needs to be discussed?

ValWood avatar Sep 13 '18 18:09 ValWood

Not that it necessarily matters, but: 371618 of 89054068 or 0.4%.

time zgrep -v '^!' *.gaf.gz | cut -f 1,2 | sort | uniq | awk '{print $2,$1}' | sort | cut -d ' ' -f 1 | uniq -c > /tmp/cols.txt
grep -v '[[:space:]]1[[:space:]]' /tmp/cols.txt > /tmp/cols-2.txt 
cat /tmp/cols-2.txt | wc -l
371618

kltm avatar Sep 13 '18 19:09 kltm

I'd note the following bookmark: http://amigo.geneontology.org/amigo/search/annotation?q=:&fq=taxon_subset_closure_label:%22Caenorhabditis%20elegans%22&sfq=document_category:%22annotation%22 Open the "Contributors" filter on the side and you can see all the people who have done C ele annotations.

kltm avatar Sep 13 '18 19:09 kltm

I'm really, really confused now. why don't I see the 286 WormBase that I reported here https://github.com/geneontology/go-annotation/issues/2071 Are they not C. elegans annotations? That might provide a clue to where they originate? @vanaukenk

ValWood avatar Sep 13 '18 19:09 ValWood

@ValWood I believe the annotations with Contributor 'WormBase' are annotations that we've made to other species via Protein2GO that, when exported by the source group, have 'WormBase' in the contributor field rather than WB.

vanaukenk avatar Sep 13 '18 20:09 vanaukenk

Got it, so, during the roundtripping, the assignee changes format ...therfore a more draconian approach to validating the assignee filed would fix this....

ValWood avatar Sep 13 '18 21:09 ValWood