blast alignment time up to 5X slower from v2.0.11 to v2.0.12
Hello @bbuchfink and thank you for developing Diamond.
I have recently tested the latest update (v2.0.12) and I am obtaining considerably slower execution times.
I do perform alignments between complete proteomes hence the sizes of query and target DBs are relatively small.
Following are the command I am using:
diamond blastp --query myproteome.faa --db myproteome.dmnd --out blast.myproteome.tsv -p 1 --mid-sensitive --comp-based-stats 1 --quiet -f 6 qseqid sseqid qstart qend sstart send bitscore --max-hsps 20
In the example myproteome.faa contains 50190 proteins (it is a eukaryote proteomes).
The execution times were the following in the two different versions of Diamond:
- v2.0.11: 1800"
- v2.0.12: 400"
NOTES:
- Using or not
--max-hsps 20does not make much difference
I am using a server with AMD EPYC Milan. I confirm the problem happens when using both the pre-compiled binaries, and compiled binaries.
Thanks a lot for your help.
I can't confirm this issue testing your command line with a thaliana proteome. Could you send me your input file to check this?
Done!
I'm seeing the same effect on your data. The difference occurs due to masking seeds based on complexity instead of frequency which was introduced in this version. Your dataset seems very repetitive which caused the frequency based masking to throw out a lot more seeds.
You can get the old behaviour back using --freq-masking. You should note the difference in sensitivity though. When using -k0 (reporting all hits), version 2.0.12 founds 19683055 alignments while v2.0.11 only found 11462957.
Also, are you sure you want to use --max-hsps as opposed to --max-target-seqs?
Hello @bbuchfink , sorry for my late reply, but I wanted to run more extensive tests before replying.
Also, are you sure you want to use --max-hsps as opposed to --max-target-seqs
Basically I use non-overlapping HSPs to increase a score I use to infer orthologs.
When first reading the change-log last weekend I suspected performance would have been affected when using --max-hsps.
I will see what I can do to mitigate the effect (using a single HSP reduces the overall recall of ~10%).
Thanks a lot, your reply was very helpful.