trimal icon indicating copy to clipboard operation
trimal copied to clipboard

Improve performance of `Similarity::calculateVectors`

Open althonos opened this issue 2 years ago • 0 comments

Hi there!

While working on althonos/pytrimal I did some thorough profiling of the code, and I identified some critical sections that could be improved. In particular, I noticed that the code in Similarity::calculateVectors was sub-optimal, because it was repeatedly calling similarityMatrix::getDistance with the same sequence characters, and the check for invalid/incorrect symbols seems to have a high performance impact.

To fix this, I added two buffers to store column data; the first one for the sequence itself, storing uppercase column characters to reduce the number of utils::toUpper calls; the other one to store the indices of gapped/indeterminate characters. The sequence characters for a column are checked once when the column is copied; after that, the distance matrix is indexed directly, without checking character ranges.

I used valgrind to count cycles on a run of trimAl in strict mode on example.073.AA.strNOG.ENOG411BFCW.fasta, here are the results in number of cycles:

Object cycles (before PR) cycles (after PR)
Similarity::calculateVectors (self) 3,132,810,840 1,652,854574
Similarity::calculateVectors (incl) 8,281,641,990 3,240,879,580
trimal (total) 13,504,155,928 8,463,690,891

althonos avatar Jun 07 '22 21:06 althonos