mygene.info
mygene.info copied to clipboard
NCBI genes map to ensembl genes with invalid identifiers
I've noticed three genes where the value for ensembl.gene
does not begin with ENSG
:
https://mygene.info/v3/gene/263?fields=ensembl
ensembl.gene appears to actually be ENSG00000237801
{"_id": "263", "_version": 1, "ensembl": {"gene": "263", "transcript": "263-1", "translation": [], "type_of_gene": "rRNA"}}
https://mygene.info/v3/gene/55872?fields=ensembl
ensembl.gene appears to actually be ENSG00000168078
{"_id": "55872", "_version": 3, "ensembl": {"gene": "55872", "transcript": "55872-1", "translation": [], "type_of_gene": "tRNA"}}
https://mygene.info/v3/gene/126231?fields=ensembl
ensembl.gene appears to actually be ENSG00000189144
{"_id": "126231", "_version": 2, "ensembl": {"gene": "126231", "transcript": "126231-1", "translation": [], "type_of_gene": "tRNA"}}
In these cases, it seems the value for ensembl.gene
has been set to entrezgene
(the ncbigene id). Any ideas what the problem is?
This issue is introduced when we're integrating Metazoa Species data from Ensembl through BioMart.
File path: ensembl_metazoa/49/gene_ensembl__gene__main.txt text based search: awk '$2 == "263" { print $0 }' gene_ensembl__gene__main.txt returns: 27923 263 rns 3153 3520 Mt 1 rRNA
And since no entrezgene id can be mapped to it. We use it as the _id. And it accidentally aligns with the genedoc with _id:263 from entrez for human species.