fastp icon indicating copy to clipboard operation
fastp copied to clipboard

Implement vectorisation to make this lib blazingly fast

Open chinwobble opened this issue 1 year ago • 11 comments

Vectorisation using SIMD should provide big performance improvements on large datasets. There are a few candidates for vectorisation such as in the merge function. https://github.com/OpenGene/fastp/blob/master/src/stats.cpp#L881-L939

I'm happy to have a go at implementing this if you can provide a sufficient fastq file for me to test with

References: https://chryswoods.com/vector_c++/portable.html https://github.com/google/highway/

chinwobble avatar Nov 26 '24 04:11 chinwobble

Thanks, but currently the merge function is not the bottleneck of performance.

sfchen avatar Nov 26 '24 11:11 sfchen

And I completely agree with you that SSE/AVX can make this tool even faster. You can take a look at fastplong, which is more time consuming and is being intensively developed.

sfchen avatar Nov 26 '24 11:11 sfchen

@chinwobble AdapterRemoval got a nice speed boost when SIMD instructions were introduced. It is a similar program to fastp so you might get the idea where the speed boost is needed.

https://github.com/MikkelSchubert/adapterremoval

teepean avatar Nov 26 '24 18:11 teepean

And I completely agree with you that SSE/AVX can make this tool even faster. You can take a look at fastplong, which is more time consuming and is being intensively developed.

Thanks I will have a go at implementing this! Can you provide some large test files for me to work with?

chinwobble avatar Nov 28 '24 01:11 chinwobble

I've had a quick attempt implementing SIMD on the Sequence::reverseComplement and its almost 10x faster on my machine.

(I understand this might not be the bottleneck).

The implementation looks like this:

string Sequence::reverseComplementHwy(string *origin)
{
    auto length = origin->length();
    const auto sequence = reinterpret_cast<const uint8_t*>(&origin[0]);
    uint8_t output[length];
    const auto transform = [](const auto d, auto output, const auto sequence) HWY_ATTR
    {
        const auto a = hn::Set(d, 65UL);
        const auto t = hn::Set(d, 84UL);
        const auto c = hn::Set(d, 67UL);
        const auto g = hn::Set(d, 71UL);
        output = hn::IfThenElse(hn::Eq(sequence, a), t, output);
        output = hn::IfThenElse(hn::Eq(sequence, t), a, output);
        output = hn::IfThenElse(hn::Eq(sequence, c), g, output);
        output = hn::IfThenElse(hn::Eq(sequence, g), c, output);
        return output;
    };

    const hn::ScalableTag<uint8_t> d;
    hn::Transform1(d, output, length, sequence, transform);

    auto retVal = reinterpret_cast<char *>(output);
    std::string reversed(retVal, length);
    return reversed;
}

Comparisons:

Run on (16 X 3693.06 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 64 KiB (x8)
  L2 Unified 512 KiB (x8)
  L3 Unified 8192 KiB (x2)
Load Average: 0.22, 0.34, 0.25
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
BM_SequenceReverseSerial        296 ns          304 ns      2141852
BM_SequenceReverseHwy          37.5 ns         38.6 ns     18777781

In order to implement this into the repo we would need to a few things:

  • change build system to cmake. This has a number of benefits including making cross platforms build easier. Currently the CI doesn't build properly on ARM64 properly.
  • Add vcpkg which depends on cmake. This make it much easier to install third party libraries. I used https://github.com/google/highway so that the code is portable (works on ARM64 and AMD64, etc).

chinwobble avatar Dec 02 '24 05:12 chinwobble

This seems very promising! Have you ever tried the arm-arch ?

sfchen avatar Dec 02 '24 06:12 sfchen

This seems very promising! Have you ever tried the arm-arch ?

Yes I have tested this in CI with mac-os ARM64. The output looks like this. (Not sure why the clock speed is so slow).

Running ./builds/ninja-multi-vcpkg/Release/bench
Run on (3 X 24 MHz CPU s)
CPU Caches:
  L1 Data 128 KiB
  L1 Instruction 192 KiB
  L2 Unified 12288 KiB (x3)
Load Average: 2.49, 5.27, 5.95
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
BM_SequenceReverseSerial        222 ns          220 ns      3102479
BM_SequenceReverseHwy          32.5 ns         32.1 ns     21874727

chinwobble avatar Dec 02 '24 08:12 chinwobble

@sfchen would you be open me changing this repo or fastplong use CMake so that we can manage external dependencies and make linux / mac builds easier?

I pushed a branch and the build looks like this. You can see samples of adding external dependencies and compiler flags. https://github.com/chinwobble/CppCMakeVcpkgTemplate/blob/fastplong/CMakeLists.txt#L19-L26

chinwobble avatar Dec 03 '24 00:12 chinwobble

@sfchen would you be open me changing this repo or fastplong use CMake so that we can manage external dependencies and make linux / mac builds easier?

I pushed a branch and the build looks like this. You can see samples of adding external dependencies and compiler flags. https://github.com/chinwobble/CppCMakeVcpkgTemplate/blob/fastplong/CMakeLists.txt#L19-L26

Hi, thanks for your suggestion. What kind of external dependencies you want to use? I was always trying to keep the simplicity of building such tools, so I didn't use CMake. You know, most of the users are with biological backgrounds, so simplicity is very important.

sfchen avatar Dec 03 '24 06:12 sfchen

@sfchen would you be open me changing this repo or fastplong use CMake so that we can manage external dependencies and make linux / mac builds easier? I pushed a branch and the build looks like this. You can see samples of adding external dependencies and compiler flags. https://github.com/chinwobble/CppCMakeVcpkgTemplate/blob/fastplong/CMakeLists.txt#L19-L26

Hi, thanks for your suggestion. What kind of external dependencies you want to use? I was always trying to keep the simplicity of building such tools, so I didn't use CMake. You know, most of the users are with biological backgrounds, so simplicity is very important.

Hey thanks for your response. I totally understand your emphasis on simplicity and potentially removing external dependencies for contributors.

Here are some problems and solutions to address your concerns:

  1. Problem: Existing projects that use SIMD use architecture specific (x86) intrinsics that are difficult to write and don't work on ARM64. For example https://github.com/MikkelSchubert/adapterremoval Solution: https://github.com/google/highway solves this problem by providing higher level algorithms that are easy to write and work across many architectures.
  2. Problem: Adding more tools such as cmake complicates the build process and makes it more difficult to contribute to this project. When I started looking into this project I couldn't get it to compile on ubuntu after cloning. The project took some time for me to setup, there was no debugging / IDE support. Solution: I can help you make this project easier to use by adding a CD pipeline to publish the binaries so that "users" can use this tool without compiling themselves. A more ambitious solution for making this tool user friendly is to build a GUI around this tool - that would be quite difficult to setup without cmake managing external dependencies such as imGUI. For developers, its much easier to debug issues if you can use the debugging, run run specific tests (see screenshot below).
  3. Does cmake provide any benefits that can't be achieved by just using the Makefile? Installing third party dependencies directly is difficult for some libraries that are more than header only libraries. The libraries that I think will be useful:
  • https://github.com/google/benchmark
  • https://github.com/google/googletest
  • https://github.com/google/highway Using cmake also makes it easier for developers to work on this project since you can easily debug, run, tests, run benchmarks etc Ideally users should be able to clone the repo, run and everything should "just work"
make

I think this can be achieved by putting all commands needed to setup cmake in the Makefile. This would be much easier to achieve if this repo contained everything we need to build including external dependencies rather than requiring the user to install it separately. This is how https://github.com/neovim/neovim is setup.

image

chinwobble avatar Dec 04 '24 09:12 chinwobble

@sfchen I've started adding SIMD and fixed the windows build on this PR. https://github.com/OpenGene/fastplong/pull/6 Can you have a look?

chinwobble avatar Dec 08 '24 07:12 chinwobble