trimal
trimal copied to clipboard
Improve performance of `Similarity::calculateVectors`
Hi there!
While working on althonos/pytrimal I did some thorough profiling of the code, and I identified some critical sections that could be improved. In particular, I noticed that the code in Similarity::calculateVectors
was sub-optimal, because it was repeatedly calling similarityMatrix::getDistance
with the same sequence characters, and the check for invalid/incorrect symbols seems to have a high performance impact.
To fix this, I added two buffers to store column data; the first one for the sequence itself, storing uppercase column characters to reduce the number of utils::toUpper
calls; the other one to store the indices of gapped/indeterminate characters. The sequence characters for a column are checked once when the column is copied; after that, the distance matrix is indexed directly, without checking character ranges.
I used valgrind
to count cycles on a run of trimAl in strict mode on example.073.AA.strNOG.ENOG411BFCW.fasta
, here are the results in number of cycles:
Object | cycles (before PR) | cycles (after PR) |
---|---|---|
Similarity::calculateVectors (self) |
3,132,810,840 | 1,652,854574 |
Similarity::calculateVectors (incl) |
8,281,641,990 | 3,240,879,580 |
trimal (total) |
13,504,155,928 | 8,463,690,891 |