foldseek
foldseek copied to clipboard
Mismatch between results from server and standalone version?
I wonder if there should be special settings to the standalone version to match the results of the server. Please replicate the situation with this PDB searched against PDB100.
What I did:
1) Standalone
foldseek easy-search AF-P64850-F1-model_v4.pdb /data/cltam/script/foldseek/database/pdb_202305/pdb AF-P64850-F1-model_v4_test-foldseek tmpfolder
top 5 hits sorted in descending order of probability:
2) Server Just upload the PDB file, submit, go to PDB100.
There are currently multiple differences between the default Foldseek command line settings and the Foldseek webserver:
- Different versions of the database: Currently the webserver is still running an older version of the PDB. The latest available through the
databases
download module is from 2023-05-17 iirc. - Different parameters: The foldseek webserver uses the following command line:
--tmscore-threshold 0.3 --max-seqs 1000 -e 10 -s 9.5 --prefilter-mode 1
. Of these--prefilter-mode 1
is the most relevant, we introduced this very recently as a different kind of prefiltering algorithm that more closely resembles our earlier software like HHblits. Here, we align each structure against all target structures with an ungapped alignment. This is faster for single structure searches than our normal prefiltering algorithm that wins again when searching many against many structures.
Hi @milot-mirdita , thanks for the explanation!
So there are two factors now:
-
Different PDB version -> I checked the top hits from web server (1ISV, 1MC9 and 1KNM), which are old structures deposited in 2002. Seems it is not an important factor in this case.
-
Different parameters -> providing the top hits are also present in the database of the standalone version, I wonder how it actually missed the top hits found by the web server. For the
--prefilter-mode 1
, does it just increase the speed? For the other flags you mentioned, I cannot correlate how they might have caused the standalone version to miss the top hits.
I have just tested the options with this command + newly downloaded the foldseek PDB database just now.
foldseek easy-search --tmscore-threshold 0.3 --max-seqs 1000 -e 10 -s 9.5 --prefilter-mode 1 AF-P64850-F1-model_v4.pdb ./pdb AF-P64850-F1-model_v4_test-foldseek_server-like tmpfolder
However, the top hits still do not include the top hits from the server.
top 5 hits sorted in descending order of probability:
Um.. please let us know the difference if you know or it will be hard to use the standalone version.
Please provide the pdb database and foldseek version used.
The server PDB database is not updated. This probably explains the differences.
However, to check if your results are consistent with ours. Please check the following.
Here is what I produced with foldseek commit 522d883fedec3ab395edd87ebe002b2b37c6ba72
and the following PDB (current version, downloaded through the database
command)
f355c78925d62c8e0388ca6f594d5ec2 pdb100.tar.gz
2023-06-20 PDB_DATE
d6743f3252147ce039abc1375afed2d03766115d FOLDSEEK_COMMIT
Command:
foldseek databases PDB pdb tmp
wget https://alphafold.ebi.ac.uk/files/AF-P64850-F1-model_v4.pdb
foldseek easy-search AF-P64850-F1-model_v4.pdb pdb aln tmp --tmscore-threshold 0.3 --max-seqs 1000 -e 10 -s 9.5 --prefilter-mode 1
Top 10 results:
AF-P64850-F1-model_v4.pdb 1knl_A 0.246 123 89 0 38 156 4 126 3.109E-09 356
AF-P64850-F1-model_v4.pdb 5gqd_A 0.233 121 89 0 38 154 305 425 1.859E-08 330
AF-P64850-F1-model_v4.pdb 1v6v_B 0.230 136 99 0 27 156 301 436 1.172E-08 329
AF-P64850-F1-model_v4.pdb 5gqe_B 0.229 128 95 0 33 156 302 429 1.393E-08 327
AF-P64850-F1-model_v4.pdb 1xyf_A 0.233 121 89 0 38 154 305 425 8.330E-08 304
AF-P64850-F1-model_v4.pdb 3a23_B 0.176 127 99 0 36 156 486 612 1.571E-07 303
AF-P64850-F1-model_v4.pdb 2d1z_B 0.240 121 88 0 38 154 305 425 1.049E-07 295
AF-P64850-F1-model_v4.pdb 4owk_E 0.186 128 98 0 35 156 3 130 2.480E-08 282
AF-P64850-F1-model_v4.pdb 4g1r_C 0.188 121 91 0 35 155 1 113 4.189E-07 278
AF-P64850-F1-model_v4.pdb 2vlc_A 0.204 125 96 0 36 156 265 389 1.763E-07 272
Can you reproduce this on your site?
Hi Martin. May I know how to check the foldseek and PDB version?
You said the web server database is not updated but how does it cause the standalone to miss the top hits captured by the web server?
My results using your options:
I'm adding more descriptions of the situation: From the web server, top hits include 1ISV, 1MC9 and 1KNM and by visualizing the structural alignment with the query, they are true positive hits at high quality, but were missed by the standalone version. Therefore it's neither merely a mismatch of results due to different versions nor options but, in this particular case, the web server outperformed the standalone version in terms of recalling true positive hits from the database, which is the problem.
The differences in the database explain this. The PDB100 is not guaranteed to contain all the chains since it is clustered. Chains 1isv
, 1mc9
, and 1knm
are not cluster representatives in the current PDB100.
To check that the stand-alone version is not broken, I ran the newest Foldseek binary against the PDB100 on the webserver and obtained the following result list. This result is inconsistent with the webserver. See the results here.
AF-P64850-F1-model_v4.pdb 1mc9_A 0.250 125 90 0 36 156 2 126 1.234E-08 345
AF-P64850-F1-model_v4.pdb 1isv_A 0.230 136 99 0 27 156 301 436 4.903E-09 338
AF-P64850-F1-model_v4.pdb 1knm_A 0.257 128 92 0 32 156 2 129 1.307E-08 336
...
❗Please use the --cluster-search 1
parameter to resolve all members from the detected PDB100 clusters.
foldseek easy-search AF-P64850-F1-model_v4.pdb pdb aln tmp --tmscore-threshold 0.3 --max-seqs 1000 -e 10 -s 9.5 --prefilter-mode 1 --cluster-search 1
By using the --cluster-search
parameter, I was able to find 1isv
, 1mc9
, 1knm
, and many more.
I see. But then the problem goes to clustering -> I wonder why PDB100 didn't capture one representative chain that is equivalent to the high alignment quality as with 1isv, 1mc9 or 1knm. Does foldseek provide the clustering info? I would like to have a check.
Yes, please check pdb100_clu
. You can convert it to a human-readable file using the following command.
foldseek createtsv pdb100 pdb100 pdb100_clu pdb100_clu.tsv
The first column is the representative id, and the second is the member id.
Hi Martin. Thank you so much for helping. After checking the clustering info, I found no problems.
However, then I discovered it is actually the prob
reported in the standalone version that is underestimated.
I grepped 1isv
from the results generated by your command with --cluster-search 1
, executed on latest foldseek version and latest PDB version (all installed + downloaded on 20230703).
The prob
for 1isv
reported in the standalone version is 0.230
However, the server reported 1.00
You need to add the prob
as output fields. What you see in the third column is the sequence identity. In default we show the following fields: query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits
.
You can do this using the following command:
foldseek easy-search AF-P64850-F1-model_v4.pdb pdb aln tmp --tmscore-threshold 0.3 --max-seqs 1000 -e 10 -s 9.5 --prefilter-mode 1 --cluster-search 1 --format-output query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits,prob
Ops! Sorry for mixing up the columns!
Unfortunately, I got an Error: Convert Alignments died
error if I include this --format-output query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits,prob
20230724 update: same error obtained using conda environment python=3.11 with
-
foldseek installed with
conda install -c conda-forge -c bioconda foldseek
-
pdb downloaded with
foldseek databases PDB pdb tmp
-
run foldseek
foldseek easy-search AF-P64850-F1-model_v4.pdb pdb aln tmp --tmscore-threshold 0.3 --max-seqs 1000 -e 10 -s 9.5 --prefilter-mode 1 --cluster-search 1 --format-output query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits,prob
Okay, we should now have a proper pdb100 where everything should work. In order to use it please update foldseek and re-download the PDB. Sorry for the delay, it took a bit of time to get it running.