trimal
trimal copied to clipboard
Implement SIMD code for faster statistics computation
Hi Nicolás, hi @scapella,
This PR is a draft implementation of support for SIMD for computing some statistics (namely, similarity and identity). As we discussed briefly at ECCB, the goal is to get all of these as optional requirements, so that trimAl can still be built on any platform.
Build system
I updated the CMakeLists.txt
to attempt to detect SSE2 support at compile-time. SSE2-specific code will only be built if the compiler supports it. The SSE2 code is kept separate so that it can be compiled with different flags if needed, and only linked at the end in the executables.
Forcing the build with or without SSE2 can be done with a single CMake flag:
$ cmake -DHAVE_SSE2=1
$ cmake -DHAVE_SSE2=0
Dynamic dispatch
At the moment, compiling trimAl with SSE2 support will make SSE2 required at runtime, which is not ideal for distributing the binary. Eventually, the goal would be to have dynamic dispatch, and select the best SIMD implementation at runtime by detecting CPU features. I've done that previously with the cpu_features
library, which could be vendored and compiled statically.
To get a bit more encapsulation, I think it would be nice if the computation of identities
and overlaps
of alignments were moved to be handled by the Manager
class, which would act as a proxy for every statistic. This would make it easier to implement the strategy design pattern for selecting the best SIMD implementation at runtime. If that sounds good for you I'll also work on that before adding more code.
Threading
At the moment, I disabled the OpenMP thread loops in the SIMD version of the stats until I find a way to shared the buffers efficiently between threads. Nevertheless, single-threaded runs with SIMD enabled is faster than multi-threaded (8 threads) runs with SIMD disabled in my benchmarks.
Performance
I didn't write comprehensive benchmarks right now, but here are how the runtime improves with -strict
and -clusters
.
$ time ./trimal_generic -in ../dataset/example.014.AA.EggNOG.COG0591.fasta -strict
________________________________________________________
Executed in 118.20 secs fish external
usr time 270.31 secs 0.00 millis 270.31 secs
sys time 0.48 secs 2.44 millis 0.47 secs
$ time ./trimal_sse2 -in ../dataset/example.014.AA.EggNOG.COG0591.fasta -strict
________________________________________________________
Executed in 29.68 secs fish external
usr time 29.22 secs 928.00 micros 29.22 secs
sys time 0.21 secs 730.00 micros 0.21 secs
$ time ./trimal_generic -in ../dataset/example.014.AA.EggNOG.COG0591.fasta -clusters 5
________________________________________________________
Executed in 64.70 secs fish external
usr time 254.46 secs 0.00 millis 254.46 secs
sys time 0.67 secs 1.54 millis 0.67 secs
$ time ./trimal_sse2 -in ../dataset/example.014.AA.EggNOG.COG0591.fasta -clusters 5
________________________________________________________
Executed in 12.61 secs fish external
usr time 31.89 secs 0.00 micros 31.89 secs
sys time 0.13 secs 670.00 micros 0.12 secs