hh-suite icon indicating copy to clipboard operation
hh-suite copied to clipboard

conda hhsuite returns different results after every run

Open bognabognabogna opened this issue 4 years ago • 12 comments

Expected Behavior

hhblits returning the same .hhr outputs after every run

Current Behavior

After every run of hhblits (with fixed parameters) on a set of 500 proteins I get different results. They don't differ quantitatively but qualitatively. In particular, some hits are missing in a couple of proteins (about 5-20 per 500 proteins)

Steps to Reproduce (for bugs)

This is very hard to reproduce because the hits seem to go missing randomly. Every run is different. Is there any source of randomness in the code? Could it be that some work is missing from some CPUs? It may be a coincidence but I seem to get more errors when using more CPUs

My Environment

The issue persists both on Mac and Linux I installed conda hhsuite using: conda create -n cov-env -c conda-forge -c bioconda hhsuite

bognabognabogna avatar Apr 06 '20 09:04 bognabognabogna

Could you post the full command that you ran? Do you have a small example that we could try to reproduce?

milot-mirdita avatar Apr 06 '20 10:04 milot-mirdita

Thank you @milot-mirdita for having a look at it. Here is the example. I hope you will be able to reproduce the issue if you run the script below twice and compare the outputs.

e=0.0001;n=2;mact=0.35;p=50;z=0;b=0; INPUT_PATH=[path_to_split_sequences]/ OUTPUT_PATH=output_path DATABASE_PATH=[path_to_database_attached]/all_proteins database.zip

CPU=4

echo 'input: ' $INPUT_PATH echo 'output: ' $OUTPUT_PATH echo 'database:' $DATABASE_PATH

mkdir $OUTPUT_PATH for file in "$INPUT_PATH"/*.fasta; do output_file=$(basename $file .fasta) output_file_hhr="${output_file}.hhr" output_path_hhr="${OUTPUT_PATH}${output_file_hhr}" hhblits -v 0 -i $file -d $DATABASE_PATH -o $output_path_hhr -e $e -n $n -cpu $CPU -mact $mact -p $p -z $z -b $b; done sequences.txt

bognabognabogna avatar Apr 07 '20 16:04 bognabognabogna

After doing some investigation I think the issue has to do with the distribution of work into CPUs. There are generally more differences when using more CPUs and results are the same (at least usually) when using CPU=1. I am wondering if this is a bug related to hhsuite or there is anything else that controls the coordination between cpus

bognabognabogna avatar Apr 14 '20 09:04 bognabognabogna

We found an issue where the results could have a different order during multi threaded execution. However I couldn’t observe a case where a result would be completely missing. Can you point me to a specific sequence in the set you sent me where this happens?

milot-mirdita avatar Apr 15 '20 16:04 milot-mirdita

Could you check again with the commit (f08506d) I just pushed if the issue keeps happening? I don't have a way to reproduce missing hits, however the order should stay consistent now.

milot-mirdita avatar Apr 16 '20 15:04 milot-mirdita

Thank you, I will try the new version. I am also attaching two .hhr files that differed after two runs (they actually had different numbers of hits). I am not sure though if it will be reproducible as those differences seem to happen randomly protein48.zip

bognabognabogna avatar Apr 17 '20 12:04 bognabognabogna

Is this the same database? This GCF_000886155.1_ViralProj42781_genomic_phanotate_55_geneCal sequence seems to have only three hits when I try to run it locally.

milot-mirdita avatar Apr 17 '20 13:04 milot-mirdita

that was a bigger database, sorry for the confusion. Attaching the bigger data set + data base single_seq.zip database.zip

bognabognabogna avatar Apr 17 '20 14:04 bognabognabogna

I think the database never finished uploading, could you check again?

milot-mirdita avatar Apr 18 '20 10:04 milot-mirdita

re-uploaded!

bognabognabogna avatar Apr 20 '20 13:04 bognabognabogna

From the file sizes, this looks to be the same database as in the initial post. I get the same number of hits with GCF_000886155.1_ViralProj42781_genomic_phanotate_55_geneCal.

milot-mirdita avatar Apr 20 '20 15:04 milot-mirdita

I am triple checking and actually with both database files I get more than 3 hits (attaching hhr). Could that be a different hhblits version? I am using HHblits 3.1.0 But this sounds like an important point. Maybe there is something more fundamental that makes my results wrong.

protein48.hhr.zip

bognabognabogna avatar Apr 21 '20 10:04 bognabognabogna