foldseek icon indicating copy to clipboard operation
foldseek copied to clipboard

How can I extract uniprot ids and corresponding foldseek sequences from a pre-generated database?

Open LTEnjoy opened this issue 2 years ago • 9 comments
trafficstars

Hi!

Thank you for your great work! I have a question that whether I can download a pre-generated database and manually generate a fasta file containing all protein names and corresponding foldseek sequences.

For example for the alphafold_swissprot database, I want to extract from this database all UniProt IDs and foldseek sequences and write it into a fasta file like:

uniprot_id_1 xxxxxxxxxxxxxxxxxxx

uniprot_id_2 xxxxxxxxxxxxxxxxxxxxxxxxxx

Thank you in advance and I'm looking forward to your reply!

LTEnjoy avatar Oct 26 '23 06:10 LTEnjoy

You can use createsubdb with a list of accessions and then call covert2fasta to make a FASTA file:

foldseek createsubdb accession_list alphafold_swissport afsp_subset --id-mode 1
foldseek convert2fasta afsp_subset afsp_subset.fasta

Please check that the accessions you pass are in the same format as the ones that are stored in the second column of the alphafold_swissport.lookup file.

milot-mirdita avatar Oct 26 '23 07:10 milot-mirdita

Thank for your quick reply! I tried above commands and it indeed generated a fasta file!

It's just slightly different than what I thought as I want to get sequences encoded by foldseek, not the residue sequences. Could you tell me how to generate that kind of fasta file?

Thank you again!

LTEnjoy avatar Oct 26 '23 07:10 LTEnjoy

You mean the 3Di sequences?

foldseek createsubdb accession_list alphafold_swissport_ss afsp_subset_ss --id-mode 1
foldseek lndb alphafold_swissport_h afsp_subset_ss_h
foldseek convert2fasta afsp_subset_ss afsp_subset_ss.fasta

milot-mirdita avatar Oct 26 '23 08:10 milot-mirdita

That's exactly what I want!

Thank you very much! Have a nice day!

LTEnjoy avatar Oct 26 '23 08:10 LTEnjoy

Hello,

When I tried the command foldseek createsubdb accession_list alphafold_swissport_ss afsp_subset_ss --id-mode 1 on af50db, I got these errors:

1698823252767

Could you tell how I can fix this problem? I want to generate all UniProt 3Di sequences from this database.

LTEnjoy avatar Nov 01 '23 07:11 LTEnjoy

I think you have to run first:

ln -s alphafold_swissport.lookup alphafold_swissport_ss.lookup

milot-mirdita avatar Nov 01 '23 08:11 milot-mirdita

I just tried this command, but errors still exist. image

Also, here are some contents in my accession_list.txt: image

LTEnjoy avatar Nov 01 '23 08:11 LTEnjoy

Could you please post all commands you executed (preferably as text and not as screenshots)? I am not sure what's going on currently.

milot-mirdita avatar Nov 03 '23 02:11 milot-mirdita

Hi,

I guess I found what the problem was. the afdb50 only contains 50M sequences after clustering. But what I need is to generate sequences from the whole UniProt database (with ~200M sequences). So I downloaded the afdb database, by which I think the problem should be solved.

LTEnjoy avatar Nov 04 '23 03:11 LTEnjoy