trimal icon indicating copy to clipboard operation
trimal copied to clipboard

Implement SIMD code for faster statistics computation

Open althonos opened this issue 2 years ago • 0 comments

Hi Nicolás, hi @scapella,

This PR is a draft implementation of support for SIMD for computing some statistics (namely, similarity and identity). As we discussed briefly at ECCB, the goal is to get all of these as optional requirements, so that trimAl can still be built on any platform.

Build system

I updated the CMakeLists.txt to attempt to detect SSE2 support at compile-time. SSE2-specific code will only be built if the compiler supports it. The SSE2 code is kept separate so that it can be compiled with different flags if needed, and only linked at the end in the executables.

Forcing the build with or without SSE2 can be done with a single CMake flag:

$ cmake -DHAVE_SSE2=1
$ cmake -DHAVE_SSE2=0

Dynamic dispatch

At the moment, compiling trimAl with SSE2 support will make SSE2 required at runtime, which is not ideal for distributing the binary. Eventually, the goal would be to have dynamic dispatch, and select the best SIMD implementation at runtime by detecting CPU features. I've done that previously with the cpu_features library, which could be vendored and compiled statically.

To get a bit more encapsulation, I think it would be nice if the computation of identities and overlaps of alignments were moved to be handled by the Manager class, which would act as a proxy for every statistic. This would make it easier to implement the strategy design pattern for selecting the best SIMD implementation at runtime. If that sounds good for you I'll also work on that before adding more code.

Threading

At the moment, I disabled the OpenMP thread loops in the SIMD version of the stats until I find a way to shared the buffers efficiently between threads. Nevertheless, single-threaded runs with SIMD enabled is faster than multi-threaded (8 threads) runs with SIMD disabled in my benchmarks.

Performance

I didn't write comprehensive benchmarks right now, but here are how the runtime improves with -strict and -clusters.

$ time ./trimal_generic -in ../dataset/example.014.AA.EggNOG.COG0591.fasta -strict
________________________________________________________
Executed in  118.20 secs    fish           external
   usr time  270.31 secs    0.00 millis  270.31 secs
   sys time    0.48 secs    2.44 millis    0.47 secs

$ time ./trimal_sse2 -in ../dataset/example.014.AA.EggNOG.COG0591.fasta -strict
________________________________________________________
Executed in   29.68 secs    fish           external
   usr time   29.22 secs  928.00 micros   29.22 secs
   sys time    0.21 secs  730.00 micros    0.21 secs
$ time ./trimal_generic -in ../dataset/example.014.AA.EggNOG.COG0591.fasta -clusters 5
________________________________________________________
Executed in   64.70 secs    fish           external
   usr time  254.46 secs    0.00 millis  254.46 secs
   sys time    0.67 secs    1.54 millis    0.67 secs

$ time ./trimal_sse2 -in ../dataset/example.014.AA.EggNOG.COG0591.fasta -clusters 5
________________________________________________________
Executed in   12.61 secs    fish           external
   usr time   31.89 secs    0.00 micros   31.89 secs
   sys time    0.13 secs  670.00 micros    0.12 secs

althonos avatar Sep 27 '22 19:09 althonos