foldseek icon indicating copy to clipboard operation
foldseek copied to clipboard

Mismatch between results from server and standalone version?

Open johnnytam100 opened this issue 1 year ago • 16 comments

I wonder if there should be special settings to the standalone version to match the results of the server. Please replicate the situation with this PDB searched against PDB100.

What I did:

1) Standalone foldseek easy-search AF-P64850-F1-model_v4.pdb /data/cltam/script/foldseek/database/pdb_202305/pdb AF-P64850-F1-model_v4_test-foldseek tmpfolder

top 5 hits sorted in descending order of probability: image

2) Server Just upload the PDB file, submit, go to PDB100.

image

johnnytam100 avatar Jun 24 '23 00:06 johnnytam100

There are currently multiple differences between the default Foldseek command line settings and the Foldseek webserver:

  1. Different versions of the database: Currently the webserver is still running an older version of the PDB. The latest available through the databases download module is from 2023-05-17 iirc.
  2. Different parameters: The foldseek webserver uses the following command line: --tmscore-threshold 0.3 --max-seqs 1000 -e 10 -s 9.5 --prefilter-mode 1. Of these --prefilter-mode 1 is the most relevant, we introduced this very recently as a different kind of prefiltering algorithm that more closely resembles our earlier software like HHblits. Here, we align each structure against all target structures with an ungapped alignment. This is faster for single structure searches than our normal prefiltering algorithm that wins again when searching many against many structures.

milot-mirdita avatar Jun 25 '23 05:06 milot-mirdita

Hi @milot-mirdita , thanks for the explanation!

So there are two factors now:

  1. Different PDB version -> I checked the top hits from web server (1ISV, 1MC9 and 1KNM), which are old structures deposited in 2002. Seems it is not an important factor in this case.

  2. Different parameters -> providing the top hits are also present in the database of the standalone version, I wonder how it actually missed the top hits found by the web server. For the --prefilter-mode 1 , does it just increase the speed? For the other flags you mentioned, I cannot correlate how they might have caused the standalone version to miss the top hits.

johnnytam100 avatar Jun 27 '23 01:06 johnnytam100

I have just tested the options with this command + newly downloaded the foldseek PDB database just now.

foldseek easy-search --tmscore-threshold 0.3 --max-seqs 1000 -e 10 -s 9.5 --prefilter-mode 1 AF-P64850-F1-model_v4.pdb ./pdb AF-P64850-F1-model_v4_test-foldseek_server-like tmpfolder

However, the top hits still do not include the top hits from the server.

top 5 hits sorted in descending order of probability: image

johnnytam100 avatar Jun 27 '23 02:06 johnnytam100

Um.. please let us know the difference if you know or it will be hard to use the standalone version.

johnnytam100 avatar Jul 01 '23 05:07 johnnytam100

Please provide the pdb database and foldseek version used.

martin-steinegger avatar Jul 01 '23 06:07 martin-steinegger

The server PDB database is not updated. This probably explains the differences. However, to check if your results are consistent with ours. Please check the following. Here is what I produced with foldseek commit 522d883fedec3ab395edd87ebe002b2b37c6ba72 and the following PDB (current version, downloaded through the database command)

f355c78925d62c8e0388ca6f594d5ec2  pdb100.tar.gz
2023-06-20      PDB_DATE
d6743f3252147ce039abc1375afed2d03766115d        FOLDSEEK_COMMIT

Command:

foldseek databases PDB pdb tmp
wget https://alphafold.ebi.ac.uk/files/AF-P64850-F1-model_v4.pdb
foldseek easy-search AF-P64850-F1-model_v4.pdb pdb aln tmp --tmscore-threshold 0.3 --max-seqs 1000 -e 10 -s 9.5 --prefilter-mode 1

Top 10 results:

AF-P64850-F1-model_v4.pdb	1knl_A	0.246	123	89	0	38	156	4	126	3.109E-09	356
AF-P64850-F1-model_v4.pdb	5gqd_A	0.233	121	89	0	38	154	305	425	1.859E-08	330
AF-P64850-F1-model_v4.pdb	1v6v_B	0.230	136	99	0	27	156	301	436	1.172E-08	329
AF-P64850-F1-model_v4.pdb	5gqe_B	0.229	128	95	0	33	156	302	429	1.393E-08	327
AF-P64850-F1-model_v4.pdb	1xyf_A	0.233	121	89	0	38	154	305	425	8.330E-08	304
AF-P64850-F1-model_v4.pdb	3a23_B	0.176	127	99	0	36	156	486	612	1.571E-07	303
AF-P64850-F1-model_v4.pdb	2d1z_B	0.240	121	88	0	38	154	305	425	1.049E-07	295
AF-P64850-F1-model_v4.pdb	4owk_E	0.186	128	98	0	35	156	3	130	2.480E-08	282
AF-P64850-F1-model_v4.pdb	4g1r_C	0.188	121	91	0	35	155	1	113	4.189E-07	278
AF-P64850-F1-model_v4.pdb	2vlc_A	0.204	125	96	0	36	156	265	389	1.763E-07	272

Can you reproduce this on your site?

martin-steinegger avatar Jul 01 '23 07:07 martin-steinegger

Hi Martin. May I know how to check the foldseek and PDB version?

You said the web server database is not updated but how does it cause the standalone to miss the top hits captured by the web server?

My results using your options:

Screenshot_20230701_182903_Chrome

johnnytam100 avatar Jul 01 '23 09:07 johnnytam100

I'm adding more descriptions of the situation: From the web server, top hits include 1ISV, 1MC9 and 1KNM and by visualizing the structural alignment with the query, they are true positive hits at high quality, but were missed by the standalone version. Therefore it's neither merely a mismatch of results due to different versions nor options but, in this particular case, the web server outperformed the standalone version in terms of recalling true positive hits from the database, which is the problem.

johnnytam100 avatar Jul 01 '23 15:07 johnnytam100

The differences in the database explain this. The PDB100 is not guaranteed to contain all the chains since it is clustered. Chains 1isv, 1mc9, and 1knm are not cluster representatives in the current PDB100.

To check that the stand-alone version is not broken, I ran the newest Foldseek binary against the PDB100 on the webserver and obtained the following result list. This result is inconsistent with the webserver. See the results here.

AF-P64850-F1-model_v4.pdb       1mc9_A  0.250   125     90      0       36      156     2       126     1.234E-08       345
AF-P64850-F1-model_v4.pdb       1isv_A  0.230   136     99      0       27      156     301     436     4.903E-09       338
AF-P64850-F1-model_v4.pdb       1knm_A  0.257   128     92      0       32      156     2       129     1.307E-08       336
...

❗Please use the --cluster-search 1 parameter to resolve all members from the detected PDB100 clusters.

foldseek easy-search AF-P64850-F1-model_v4.pdb pdb aln tmp --tmscore-threshold 0.3 --max-seqs 1000 -e 10 -s 9.5 --prefilter-mode 1 --cluster-search 1

By using the --cluster-search parameter, I was able to find 1isv, 1mc9, 1knm, and many more.

martin-steinegger avatar Jul 02 '23 03:07 martin-steinegger

I see. But then the problem goes to clustering -> I wonder why PDB100 didn't capture one representative chain that is equivalent to the high alignment quality as with 1isv, 1mc9 or 1knm. Does foldseek provide the clustering info? I would like to have a check.

johnnytam100 avatar Jul 02 '23 07:07 johnnytam100

Yes, please check pdb100_clu. You can convert it to a human-readable file using the following command.

foldseek createtsv pdb100 pdb100 pdb100_clu pdb100_clu.tsv

The first column is the representative id, and the second is the member id.

martin-steinegger avatar Jul 02 '23 10:07 martin-steinegger

Hi Martin. Thank you so much for helping. After checking the clustering info, I found no problems.

However, then I discovered it is actually the prob reported in the standalone version that is underestimated. I grepped 1isv from the results generated by your command with --cluster-search 1, executed on latest foldseek version and latest PDB version (all installed + downloaded on 20230703).

The prob for 1isv reported in the standalone version is 0.230 image

However, the server reported 1.00 image

johnnytam100 avatar Jul 03 '23 01:07 johnnytam100

You need to add the prob as output fields. What you see in the third column is the sequence identity. In default we show the following fields: query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits.

You can do this using the following command:

foldseek easy-search AF-P64850-F1-model_v4.pdb pdb aln tmp --tmscore-threshold 0.3 --max-seqs 1000 -e 10 -s 9.5 --prefilter-mode 1 --cluster-search 1 --format-output query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits,prob

martin-steinegger avatar Jul 03 '23 02:07 martin-steinegger

Ops! Sorry for mixing up the columns! Unfortunately, I got an Error: Convert Alignments died error if I include this --format-output query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits,prob

image

johnnytam100 avatar Jul 03 '23 12:07 johnnytam100

20230724 update: same error obtained using conda environment python=3.11 with

  1. foldseek installed with conda install -c conda-forge -c bioconda foldseek

  2. pdb downloaded with foldseek databases PDB pdb tmp

  3. run foldseek foldseek easy-search AF-P64850-F1-model_v4.pdb pdb aln tmp --tmscore-threshold 0.3 --max-seqs 1000 -e 10 -s 9.5 --prefilter-mode 1 --cluster-search 1 --format-output query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits,prob

image

johnnytam100 avatar Jul 24 '23 10:07 johnnytam100

Okay, we should now have a proper pdb100 where everything should work. In order to use it please update foldseek and re-download the PDB. Sorry for the delay, it took a bit of time to get it running.

martin-steinegger avatar Aug 18 '23 16:08 martin-steinegger