mygene.info icon indicating copy to clipboard operation
mygene.info copied to clipboard

query API returns empty summary on some genes

Open jingjingbic opened this issue 1 year ago • 1 comments

We are using this function ( https://mygene.info/v3/query?q=symbol:POLA2&size=1&species=human&fields=name,summary ) to get the summary of gene POLA2, the call returns no summary. However, if we search this gene in NCBI, we can see there is a summary for this gene on this page https://www.ncbi.nlm.nih.gov/gene/23649. Does Mygene.info pull the summary value from NCBI or from another data source? Query on DRG1 has the similar issue.

jingjingbic avatar Aug 24 '22 00:08 jingjingbic

@jingjingbic thanks for reporting this to us. We did some investigation and found out why this happens.

The summary field in MyGene.info was obtained from NCBI's refseq records.

For example, summary of gene CDK2 comes from NM_001798 (under "COMMENT" section, starts with "Summary:")

This works for pretty much all genes in the past, however, as you pointed out, we now start to see some gene summary values are not coming from the corresponding refseq record.

I think there could be two reasons:

  1. There is some delay for NCBI to include summary to some RefSeq records (or potentially could be a mistake too). In this case we will just wait for RefSeq to update. MyGene.info keeps synced very closely with NCBI, once RefSeq is updated (current release 213), MyGene.info should pick up the updates in a week or so.

  2. It's likely NCBI has another place to store some gene summary data, in addition to RefSeq records. We cannot locate where the summary of gene POLA2 is from all the data files we synced with NCBI. We will have to reach out to NCBI on this.

Either way, looks like this is something we should double check with NCBI. Depending on their response, we can decide whether any changes are needed on MyGene.info side.

newgene avatar Aug 26 '22 15:08 newgene

We contacted NCBI helpdesk and confirmed this:

We currently do not add the summaries imported from the Alliance of Genome Resources onto the RefSeq transcript records. Summaries are also not added to model RefSeqs.

Instead of Refseq records, the complete set of gene summary text are available from NCBI's ASN1 binary dump files. We can modify our pipeline to extract gene summary from these files instead. A separate issue #130 was created for this task.

newgene avatar Sep 04 '22 22:09 newgene

Temporary fix to human genes is done.

jal347 avatar Sep 27 '22 22:09 jal347