mygene.info icon indicating copy to clipboard operation
mygene.info copied to clipboard

Proteins with no genes

Open stuppie opened this issue 6 years ago • 2 comments

I think it would be useful for mygene to also store information about proteins with no associated Entrez record. For example: http://www.uniprot.org/uniprot/A2NXD2 http://www.uniprot.org/uniprot/Q5NV61

stuppie avatar Mar 05 '18 23:03 stuppie

@newgene this issue would require to adjust ID conversion in uniprot parser. Currently it tries to convert uniprot_acc to entrez ID, or if not possible, Ensembl ID. But if none of them are available the document is skipped. Probably some fix around this: https://github.com/biothings/mygene.info/blob/master/src/hub/dataload/sources/uniprot/parser.py#L53. What do you think ?

sirloon avatar May 10 '18 17:05 sirloon

We need to give more thoughts on this one. Supposedly MyGene.info is all about genes, if not a gene, no record in MyGene.info. But I agree, including those uniprot IDs is useful, as genes and proteins are often so tied together. With no associated gene ID for a protein, it just means the corresponding gene has not be identified yet, but there should be a gene somewhere in the genome encoding this protein.

With this in mind, I am not against the idea of giving a "fake" gene id place-holder for a document, and put the corresponding uniprot ID within this document (so that this uniprot ID will be searchable).

One way of making this "fake" gene id is like this:

"_id": "NO_GENE_ID_FOR_A2NXD2"

This expands the gene _id priority list to three tier: NCBI Gene ID-->Ensembl Gene ID-->NO_GENE_ID for Uniprot-only gene.

Your opinions? @stuppie @sirloon @cyrus0824 @andrewsu

newgene avatar May 10 '18 22:05 newgene