mygene.info icon indicating copy to clipboard operation
mygene.info copied to clipboard

NCBI genes map to ensembl genes with invalid identifiers

Open dhimmel opened this issue 3 years ago • 1 comments

I've noticed three genes where the value for ensembl.gene does not begin with ENSG:

https://mygene.info/v3/gene/263?fields=ensembl
ensembl.gene appears to actually be ENSG00000237801
{"_id": "263", "_version": 1, "ensembl": {"gene": "263", "transcript": "263-1", "translation": [], "type_of_gene": "rRNA"}}

https://mygene.info/v3/gene/55872?fields=ensembl
ensembl.gene appears to actually be ENSG00000168078
{"_id": "55872", "_version": 3, "ensembl": {"gene": "55872", "transcript": "55872-1", "translation": [], "type_of_gene": "tRNA"}}

https://mygene.info/v3/gene/126231?fields=ensembl
ensembl.gene appears to actually be ENSG00000189144
{"_id": "126231", "_version": 2, "ensembl": {"gene": "126231", "transcript": "126231-1", "translation": [], "type_of_gene": "tRNA"}}

In these cases, it seems the value for ensembl.gene has been set to entrezgene (the ncbigene id). Any ideas what the problem is?

dhimmel avatar Dec 11 '20 16:12 dhimmel

This issue is introduced when we're integrating Metazoa Species data from Ensembl through BioMart.

File path: ensembl_metazoa/49/gene_ensembl__gene__main.txt text based search: awk '$2 == "263" { print $0 }' gene_ensembl__gene__main.txt returns: 27923 263 rns 3153 3520 Mt 1 rRNA

And since no entrezgene id can be mapped to it. We use it as the _id. And it accidentally aligns with the genedoc with _id:263 from entrez for human species.

kevinxin90 avatar Dec 17 '20 01:12 kevinxin90