amigo
amigo copied to clipboard
Gramene appears to have duplicated entries
I m not sure at which step the merge of these two entries is not done.
Thanks, Pascale
These appear to be two "separate" entities, from two different sources, that happen to have the same local ID and symbol (but do not collide as they are in different namespaces), and are treated as such: http://amigo.geneontology.org/amigo/gene_product/GR_protein:Q84MN8 http://amigo.geneontology.org/amigo/gene_product/UniProtKB:Q84MN8 The fix here would be in either PAINT or the upstream GAF.
@kltm who provides these namespaces ? Evidently GR uses one namespace and the annotation provided by PAINT, another.
Are there instructions as to which namespace should be provided for each species ? Could the pipeline reconcile these ID by comparing the IDs?
Thanks, Pascale
@pgaudet These are whatever is in the upstream GAFs. Namespaces are by resource, not species, enforced by convention.
I don't understand why the incorrect namespace for submitter can't be captured by GO syntax checks an reported back to the submitter. We have a file describing how they should be called.
Yes, a new rule to enforce certain namespaces to certain files could be created and enforced. I think this is something that probably needs discussion.
Is is possible to know the extent of the problem ? ie how many identical ids live in different name spaces ?
@pgaudet Give me a minute...
Is is possible to know the extent of the problem ? ie how many identical ids live in different name spaces ?
I don't think ID clashes is really the issue.
WormBase and FlyBase presumably only want one namespace, unless these namespaces have different meanings and that isn't clear. The namespace should presumably be the one in https://github.com/geneontology/go-site/blob/master/metadata/db-xrefs.yaml
The main issue is that we often use "assigned by" to filter in on a particular groups' annotation, and it shouldn't be split between "2" resources, even if only for stats, and sanity.
Having a restriction on the namespace would prevent all sorts of frequently occurring problems (like UniprotKB/UniProt).
I would vote that this just happens, it would close a lot of tickets, and save a lot of time. what needs to be discussed?
Not that it necessarily matters, but: 371618 of 89054068 or 0.4%.
time zgrep -v '^!' *.gaf.gz | cut -f 1,2 | sort | uniq | awk '{print $2,$1}' | sort | cut -d ' ' -f 1 | uniq -c > /tmp/cols.txt
grep -v '[[:space:]]1[[:space:]]' /tmp/cols.txt > /tmp/cols-2.txt
cat /tmp/cols-2.txt | wc -l
371618
I'd note the following bookmark: http://amigo.geneontology.org/amigo/search/annotation?q=:&fq=taxon_subset_closure_label:%22Caenorhabditis%20elegans%22&sfq=document_category:%22annotation%22 Open the "Contributors" filter on the side and you can see all the people who have done C ele annotations.
I'm really, really confused now. why don't I see the 286 WormBase that I reported here https://github.com/geneontology/go-annotation/issues/2071 Are they not C. elegans annotations? That might provide a clue to where they originate? @vanaukenk
@ValWood I believe the annotations with Contributor 'WormBase' are annotations that we've made to other species via Protein2GO that, when exported by the source group, have 'WormBase' in the contributor field rather than WB.
Got it, so, during the roundtripping, the assignee changes format ...therfore a more draconian approach to validating the assignee filed would fix this....