foldseek
foldseek copied to clipboard
How can I extract uniprot ids and corresponding foldseek sequences from a pre-generated database?
Hi!
Thank you for your great work! I have a question that whether I can download a pre-generated database and manually generate a fasta file containing all protein names and corresponding foldseek sequences.
For example for the alphafold_swissprot database, I want to extract from this database all UniProt IDs and foldseek sequences and write it into a fasta file like:
uniprot_id_1 xxxxxxxxxxxxxxxxxxx
uniprot_id_2 xxxxxxxxxxxxxxxxxxxxxxxxxx
Thank you in advance and I'm looking forward to your reply!
You can use createsubdb with a list of accessions and then call covert2fasta to make a FASTA file:
foldseek createsubdb accession_list alphafold_swissport afsp_subset --id-mode 1
foldseek convert2fasta afsp_subset afsp_subset.fasta
Please check that the accessions you pass are in the same format as the ones that are stored in the second column of the alphafold_swissport.lookup file.
Thank for your quick reply! I tried above commands and it indeed generated a fasta file!
It's just slightly different than what I thought as I want to get sequences encoded by foldseek, not the residue sequences. Could you tell me how to generate that kind of fasta file?
Thank you again!
You mean the 3Di sequences?
foldseek createsubdb accession_list alphafold_swissport_ss afsp_subset_ss --id-mode 1
foldseek lndb alphafold_swissport_h afsp_subset_ss_h
foldseek convert2fasta afsp_subset_ss afsp_subset_ss.fasta
That's exactly what I want!
Thank you very much! Have a nice day!
Hello,
When I tried the command foldseek createsubdb accession_list alphafold_swissport_ss afsp_subset_ss --id-mode 1 on af50db, I got these errors:
Could you tell how I can fix this problem? I want to generate all UniProt 3Di sequences from this database.
I think you have to run first:
ln -s alphafold_swissport.lookup alphafold_swissport_ss.lookup
I just tried this command, but errors still exist.
Also, here are some contents in my accession_list.txt:
Could you please post all commands you executed (preferably as text and not as screenshots)? I am not sure what's going on currently.
Hi,
I guess I found what the problem was. the afdb50 only contains 50M sequences after clustering. But what I need is to generate sequences from the whole UniProt database (with ~200M sequences). So I downloaded the afdb database, by which I think the problem should be solved.