mygene.info icon indicating copy to clipboard operation
mygene.info copied to clipboard

queries fail for some uniprot accessions

Open ftwkoopmans opened this issue 1 year ago • 1 comments

Some uniprot accessions are not available for querying nor as output in the "uniprot" field/scope. To illustrate I've included 2 examples, one accession that works (P63044) and one that fails (P23819).

this works via https://mygene.info/v3/api#/query/get_query ; "q" input: P63044 "fields" input: symbol,name,taxid,entrezgene,uniprot

returns:

{
  "took": 16,
  "total": 1,
  "max_score": 17.406927,
  "hits": [
    {
      "_id": "22318",
      "_score": 17.406927,
      "entrezgene": "22318",
      "name": "vesicle-associated membrane protein 2",
      "symbol": "Vamp2",
      "taxid": 10090,
      "uniprot": {
        "Swiss-Prot": "P63044",
        "TrEMBL": "Q8CHR4"
      }
    }
  ]
}

this works via https://mygene.info/v3/api#/query/get_query ; in "q" input: P23819 in "fields" input: symbol,name,taxid,entrezgene,uniprot

and returns:

{
  "took": 13,
  "total": 1,
  "max_score": 7.8478303,
  "hits": [
    {
      "_id": "14800",
      "_score": 7.8478303,
      "entrezgene": "14800",
      "name": "glutamate receptor, ionotropic, AMPA2 (alpha 2)",
      "symbol": "Gria2",
      "taxid": 10090,
      "uniprot": {
        "TrEMBL": "Q4LG64"
      }
    }
  ]
}

However, note that for the latter query, the uniprot input ID that I queried (a swissprot record) is not included in the "uniprot" output field! So it seems there is a problem with the mygene.info database, possibly a subset of uniprot accessions/IDs are not stored/linked under "uniprot". Other examples are P23819, Q61941, Q8VHW2.

Furthermore, POST queries against these accessions fail even though they should not (probably same root cause).

this works via https://mygene.info/v3/api#/query/post_query ; { "q": "P63044", "scopes": "uniprot" } returns:

[
  {
    "query": "P63044",
    "_id": "22318",
    "_score": 16.7524,
    "entrezgene": "22318",
    "name": "vesicle-associated membrane protein 2",
    "symbol": "Vamp2",
    "taxid": 10090
  }
]

this query fails, but it should not as this is a valid uniprot accesion that is in the mygene.info dataset (see GET query above) ; { "q": "P23819", "scopes": "uniprot" } returns:

[
  {
    "query": "P23819",
    "notfound": true
  }
]

ftwkoopmans avatar Jul 07 '22 11:07 ftwkoopmans

Just to add a tiny bit more info. I suspect the difference in behavior between P63044 and P23819 is due to the lack of an Entrez Gene mapping in the UniProt file for P23819.

The source file for the uniprot data plugin appears to be https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/idmapping_selected.tab.gz.

From the README, the column headings for this file are as follows:

1. UniProtKB-AC
2. UniProtKB-ID
3. GeneID (EntrezGene)
4. RefSeq
5. GI
6. PDB
7. GO
8. UniRef100
9. UniRef90
10. UniRef50
11. UniParc
12. PIR
13. NCBI-taxon
14. MIM
15. UniGene
16. PubMed
17. EMBL
18. EMBL-CDS
19. Ensembl
20. Ensembl_TRS
21. Ensembl_PRO
22. Additional PubMed

Note the difference in the records below in column 3 which should have a mapping to Entrez Gene.

$ gzip -cd idmapping_selected.tab.gz | awk '$1=="P63044"' | tr "\t" "\n" | cat -n | head
     1  P63044
     2  VAMP2_MOUSE
     3  22318
     4  NP_033523.1
     5  51704193; 6678551
     6
     7  GO:0030136; GO:0060203; GO:0005737; GO:0031410; GO:0030659; GO:0030285; GO:0043231; GO:0043229; GO:0016020; GO:0043005; GO:0044306; GO:0048471; GO:0005886; GO:0030141; GO:0030667; GO:0031201; GO:0000322; GO:0045202; GO:0008021; GO:0030672; GO:0070044; GO:0070032; GO:0070033; GO:0005802; GO:0031982; GO:0042589; GO:0048306; GO:0005516; GO:0042802; GO:0017022; GO:0005543; GO:0008022; GO:0044877; GO:0005484; GO:0000149; GO:0019905; GO:0017075; GO:0044325; GO:0017156; GO:0032869; GO:0043308; GO:0098967; GO:0043001; GO:0046879; GO:0060291; GO:0061025; GO:0090316; GO:0015031; GO:0065003; GO:0045055; GO:0017158; GO:1902259; GO:0017157; GO:1903421; GO:0060627; GO:0009749; GO:0035493; GO:0016081; GO:0048488; GO:0016079; GO:0006906; GO:0016192
     8  UniRef100_P63044
     9  UniRef90_P63044
    10  UniRef50_P63044
$ gzip -cd idmapping_selected.tab.gz | awk '$1=="P23819"' | tr "\t" "\n" | cat -n | head
     1  P23819
     2  GRIA2_MOUSE
     3
     4
     5  496139; 22096313; 26335713; 496140; 12852206
     6  7LDD:B; 7LDD:D; 7LDE:B; 7LDE:D; 7LEP:B; 7LEP:D
     7  GO:0032281; GO:0032279; GO:0009986; GO:0030425; GO:0032839; GO:0043198; GO:0043197; GO:0005783; GO:0005789; GO:0098978; GO:0030426; GO:0005887; GO:0099061; GO:0099055; GO:0099056; GO:0016020; GO:0043005; GO:0043025; GO:0043204; GO:0099544; GO:0005886; GO:0014069; GO:0098839; GO:0045211; GO:0042734; GO:0032991; GO:0098685; GO:0036477; GO:0045202; GO:0097060; GO:0008021; GO:0030672; GO:0043195; GO:0004971; GO:0001540; GO:0051117; GO:0008092; GO:0005234; GO:0035254; GO:0042802; GO:0019865; GO:0004970; GO:0015277; GO:0015276; GO:0030165; GO:0019901; GO:0038023; GO:0000149; GO:1904315; GO:0007268; GO:0045184; GO:0035235; GO:0050806; GO:0051262; GO:0031623; GO:0001919; GO:0051966
     8  UniRef100_P23819
     9  UniRef90_P19491-3
    10  UniRef50_P19491

This difference can also be seen on the corresponding UniProt web pages

  • https://www.uniprot.org/uniprotkb/P63044/entry --> 22318
  • https://www.uniprot.org/uniprotkb/P23819/entry -> none

Having said that, the reciprocal links do exist in NCBI Gene (likely through a mapping to Refseq Protein):

  • https://ncbi.nlm.nih.gov/gene/?term=P63044 -> 22318
  • https://ncbi.nlm.nih.gov/gene/?term=P23819 -> 14800

andrewsu avatar Jul 15 '22 05:07 andrewsu