hh-suite
hh-suite copied to clipboard
conda hhsuite returns different results after every run
Expected Behavior
hhblits returning the same .hhr outputs after every run
Current Behavior
After every run of hhblits (with fixed parameters) on a set of 500 proteins I get different results. They don't differ quantitatively but qualitatively. In particular, some hits are missing in a couple of proteins (about 5-20 per 500 proteins)
Steps to Reproduce (for bugs)
This is very hard to reproduce because the hits seem to go missing randomly. Every run is different. Is there any source of randomness in the code? Could it be that some work is missing from some CPUs? It may be a coincidence but I seem to get more errors when using more CPUs
My Environment
The issue persists both on Mac and Linux I installed conda hhsuite using: conda create -n cov-env -c conda-forge -c bioconda hhsuite
Could you post the full command that you ran? Do you have a small example that we could try to reproduce?
Thank you @milot-mirdita for having a look at it. Here is the example. I hope you will be able to reproduce the issue if you run the script below twice and compare the outputs.
e=0.0001;n=2;mact=0.35;p=50;z=0;b=0; INPUT_PATH=[path_to_split_sequences]/ OUTPUT_PATH=output_path DATABASE_PATH=[path_to_database_attached]/all_proteins database.zip
CPU=4
echo 'input: ' $INPUT_PATH echo 'output: ' $OUTPUT_PATH echo 'database:' $DATABASE_PATH
mkdir $OUTPUT_PATH for file in "$INPUT_PATH"/*.fasta; do output_file=$(basename $file .fasta) output_file_hhr="${output_file}.hhr" output_path_hhr="${OUTPUT_PATH}${output_file_hhr}" hhblits -v 0 -i $file -d $DATABASE_PATH -o $output_path_hhr -e $e -n $n -cpu $CPU -mact $mact -p $p -z $z -b $b; done sequences.txt
After doing some investigation I think the issue has to do with the distribution of work into CPUs. There are generally more differences when using more CPUs and results are the same (at least usually) when using CPU=1. I am wondering if this is a bug related to hhsuite or there is anything else that controls the coordination between cpus
We found an issue where the results could have a different order during multi threaded execution. However I couldn’t observe a case where a result would be completely missing. Can you point me to a specific sequence in the set you sent me where this happens?
Could you check again with the commit (f08506d) I just pushed if the issue keeps happening? I don't have a way to reproduce missing hits, however the order should stay consistent now.
Thank you, I will try the new version. I am also attaching two .hhr files that differed after two runs (they actually had different numbers of hits). I am not sure though if it will be reproducible as those differences seem to happen randomly protein48.zip
Is this the same database?
This GCF_000886155.1_ViralProj42781_genomic_phanotate_55_geneCal
sequence seems to have only three hits when I try to run it locally.
that was a bigger database, sorry for the confusion. Attaching the bigger data set + data base single_seq.zip database.zip
I think the database never finished uploading, could you check again?
re-uploaded!
From the file sizes, this looks to be the same database as in the initial post.
I get the same number of hits with GCF_000886155.1_ViralProj42781_genomic_phanotate_55_geneCal
.
I am triple checking and actually with both database files I get more than 3 hits (attaching hhr). Could that be a different hhblits version? I am using HHblits 3.1.0 But this sounds like an important point. Maybe there is something more fundamental that makes my results wrong.