diamond icon indicating copy to clipboard operation
diamond copied to clipboard

blast alignment time up to 5X slower from v2.0.11 to v2.0.12

Open salvoc81 opened this issue 4 years ago • 4 comments

Hello @bbuchfink and thank you for developing Diamond.

I have recently tested the latest update (v2.0.12) and I am obtaining considerably slower execution times.

I do perform alignments between complete proteomes hence the sizes of query and target DBs are relatively small.

Following are the command I am using: diamond blastp --query myproteome.faa --db myproteome.dmnd --out blast.myproteome.tsv -p 1 --mid-sensitive --comp-based-stats 1 --quiet -f 6 qseqid sseqid qstart qend sstart send bitscore --max-hsps 20

In the example myproteome.faa contains 50190 proteins (it is a eukaryote proteomes). The execution times were the following in the two different versions of Diamond:

  • v2.0.11: 1800"
  • v2.0.12: 400"

NOTES:

  • Using or not --max-hsps 20 does not make much difference

I am using a server with AMD EPYC Milan. I confirm the problem happens when using both the pre-compiled binaries, and compiled binaries.

Thanks a lot for your help.

salvoc81 avatar Oct 09 '21 09:10 salvoc81

I can't confirm this issue testing your command line with a thaliana proteome. Could you send me your input file to check this?

bbuchfink avatar Oct 09 '21 10:10 bbuchfink

Done!

salvoc81 avatar Oct 09 '21 12:10 salvoc81

I'm seeing the same effect on your data. The difference occurs due to masking seeds based on complexity instead of frequency which was introduced in this version. Your dataset seems very repetitive which caused the frequency based masking to throw out a lot more seeds.

You can get the old behaviour back using --freq-masking. You should note the difference in sensitivity though. When using -k0 (reporting all hits), version 2.0.12 founds 19683055 alignments while v2.0.11 only found 11462957.

Also, are you sure you want to use --max-hsps as opposed to --max-target-seqs?

bbuchfink avatar Oct 09 '21 14:10 bbuchfink

Hello @bbuchfink , sorry for my late reply, but I wanted to run more extensive tests before replying.

Also, are you sure you want to use --max-hsps as opposed to --max-target-seqs

Basically I use non-overlapping HSPs to increase a score I use to infer orthologs.
When first reading the change-log last weekend I suspected performance would have been affected when using --max-hsps. I will see what I can do to mitigate the effect (using a single HSP reduces the overall recall of ~10%).

Thanks a lot, your reply was very helpful.

salvoc81 avatar Oct 11 '21 01:10 salvoc81