trimal Improve performance of `Similarity::calculateVectors`

Improve performance of `Similarity::calculateVectors`

Open althonos opened this issue 2 years ago • 0 comments

Hi there!

While working on althonos/pytrimal I did some thorough profiling of the code, and I identified some critical sections that could be improved. In particular, I noticed that the code in Similarity::calculateVectors was sub-optimal, because it was repeatedly calling similarityMatrix::getDistance with the same sequence characters, and the check for invalid/incorrect symbols seems to have a high performance impact.

To fix this, I added two buffers to store column data; the first one for the sequence itself, storing uppercase column characters to reduce the number of utils::toUpper calls; the other one to store the indices of gapped/indeterminate characters. The sequence characters for a column are checked once when the column is copied; after that, the distance matrix is indexed directly, without checking character ranges.

I used valgrind to count cycles on a run of trimAl in strict mode on example.073.AA.strNOG.ENOG411BFCW.fasta, here are the results in number of cycles:

Object	cycles (before PR)	cycles (after PR)
`Similarity::calculateVectors` (self)	3,132,810,840	1,652,854574
`Similarity::calculateVectors` (incl)	8,281,641,990	3,240,879,580
`trimal` (total)	13,504,155,928	8,463,690,891

Jun 07 '22 21:06 althonos

trimal trimal copied to clipboard

Improve performance of `Similarity::calculateVectors`

trimal
trimal copied to clipboard