gapseq icon indicating copy to clipboard operation
gapseq copied to clipboard

[UniProt] new API silently breaks gapseq_find

Open MRMHmdeleeuw opened this issue 2 years ago • 1 comments

FYI, I created an issue #275 with UniProt explaining how the new UniProt website & REST API breaks uniprot.sh 's querying of UniRef. This may go unnoticed for users of gapseq, since now queries just return immediate empty results and gapseq will only use the stored sequences installed through update_sequences.sh.

The queries still work on the legacy UniProt server, but this requires patching of uniprot.sh and this access will be discontinued with the next UniProt 2022_03 release.

MRMHmdeleeuw avatar Jul 03 '22 11:07 MRMHmdeleeuw

Thank you very much for pointing out this issue.

With the commit 82ff4ad1c5c5f9c2bba17693b7e43ee8839330c4 we should have a working sequence download again. It works in three steps:

  1. Find UniprotKB IDs of query-matching proteins
  2. Match IDs/Accessions from (1) to identity-based protein clusters (UniRef)
  3. Retrieve representative Sequences of protein clusters

Unfortunately, the second step takes way too long when step 1 results in like 10 000+ hits, in part because we currently use GET in batches to find the matching UniRef cluster IDs as suggested in the issue #275 you created.

I've just found out that there is a POST option with Uniprot REST for ID matching: https://www.uniprot.org/help/id_mapping We will try that option, too, and hope that this works faster.

Thanks again Silvio

Waschina avatar Oct 03 '22 18:10 Waschina