seqan3 icon indicating copy to clipboard operation
seqan3 copied to clipboard

[PERF BUG] improve perfomance of dna4::complement

Open kloetzl opened this issue 4 years ago • 22 comments

Platform

  • SeqAn version: master (3.0.0-1571-g96216a19)
  • Operating system: Linux AlAnfa 5.7.7-arch1-1 #1 SMP PREEMPT Wed, 01 Jul 2020 14:53:16 +0000 x86_64 GNU/Linux
  • Compiler: g++ (GCC) 10.1.0

Description

Adding the following function to dna4 makes complementing it about 16% faster. Patch at https://github.com/kloetzl/seqan3/commit/c43427f5092b95623eb07477513bb3afc40ff083

  public:
    constexpr dna4 complement() const noexcept
    {
        return dna4{}.assign_rank(3 - to_rank());
    }

Now, you might stop here and think this is ugly and 16% is not worth it. Or you go all in for performance and merge this patch. However, there is more to this.

I am currently looking into fast ways to complement dna strings. Comparing my own implementation with BWA and Seqan3, I noticed that the instructions generated for seqan are suboptimal. (Note that I am not an expert on seqan, I just used what was recommended in the tutorial.) Digging into the code I found that it can be simplified and boils down to a subtraction. Here are the runtimes of computing the reverse complement of a megabase:

-------------------------------------------------------------
Benchmark                   Time             CPU   Iterations
-------------------------------------------------------------
dnax_revcomp          1086088 ns      1051968 ns          665
bwa_bench              190138 ns       184017 ns         3575
bench/dna4_revcomp     177823 ns       171157 ns         4092
bench_seqan3_dna4      621127 ns       606884 ns         1162

Even with the patch seqan is almost nine time slower than BWA which also uses a subtraction. However, the BWA code gets vectorized by the compiler, whereas seqan does not. So there is something in the template magic preventing that optimization. Here is the assembly that GCC generates:

seqan

Interestingly even this assembly is kinda weird. Note how the $0x3 is stored in a register %esi, and that is then moved on each iteration of the loop. There might be something fishy going on here because rank_type is unsigned char.

Just wanted to share what I found in the hope you can make sense of it. ☺ Best, Fabian

kloetzl avatar Jul 06 '20 20:07 kloetzl

Thank you for investigating this!

The templates shouldn't matter too much. The compiler is pretty good at resolving them into normal for loops, but saying that views are not completely cost free.

The real bottleneck is that we internally use look-up tables to convert char to seqan3::dna4 and the same for reverse complementing a seqan3::dna4. Look-up tables are inherently bad for SIMD and I have stated several time that arithmetic operations might be better for exactly this use case (I just talked about this today with @rrahn). Since we would benefit more with overall performance if we would use SIMD instead of single core performance.

I never had the time to validate those claims, because we are currently stabilizing the public interfaces.

So it is fascinating that you are validating my gut feeling :)

Our internal rank representation is currently 0: A, 1: C, 2: G, 3: T, so 3 - dna4 letter is indeed the correct way to calculate the reverse complement. It would be interesting to see if we could find a similar formula for seqan3::dna15 (i.e. IUPAC).

It would be nice if you could try this patch out:

diff --git a/include/seqan3/alphabet/nucleotide/dna4.hpp b/include/seqan3/alphabet/nucleotide/dna4.hpp
index ae86fc609..9f60668ca 100644
--- a/include/seqan3/alphabet/nucleotide/dna4.hpp
+++ b/include/seqan3/alphabet/nucleotide/dna4.hpp
@@ -82,6 +82,18 @@ public:
     }
     //!\}
 
+    constexpr dna4 & assign_char(char_type const c) noexcept
+    {
+        char_type upper_char = c & 0b0101'1111; // to_upper
+        assign_rank((upper_char == 'T') * 3 + (upper_char == 'G') * 2 + (upper_char == 'C'));
+        return *this;
+    }
+
+    constexpr dna4 complement() const noexcept
+    {
+        return dna4{}.assign_rank(to_rank() ^ 0b11);
+    }
+
 protected:
     //!\privatesection

(This won't do any look-ups and will only use arithmetic operations, which should be auto-vectorisable)

This is your benchmark, right?

	auto s = std::string{forward};
	auto s_as_dna = s | seqan3::views::char_to<seqan3::dna4>;

	for (auto _ : state) {
		auto revcomp =
			s_as_dna | std::views::reverse | seqan3::views::complement;

		// force evaluation of view
		for (auto foo : revcomp)
			benchmark::DoNotOptimize(foo);
	}

Can you try this variant for me (after you applied the patch)?:

	auto s = std::string{forward};

	for (auto _ : state) {
		auto revcomp =
			s | std::views::reverse | seqan3::views::char_to<seqan3::dna4> | seqan3::views::complement;

		// force evaluation of view
		for (auto foo : revcomp)
			benchmark::DoNotOptimize(foo);
	}

For cache behaviour I would first reverse and when do the char transformation. These are all lazy operations and the order shouldn't matter too much, but I would expect that doing it in this order should be tiny bit faster.

marehr avatar Jul 06 '20 21:07 marehr

The compiler is pretty good at resolving them into normal for loops.

That is the issue here. Even with my patch applied the code is compiled into a normal, scaler loop, not a (auto-)vectorized one despite full optimizations.

It would be interesting to see if we could find a similar formula for seqan3::dna15 (i.e. IUPAC).

On x86_64 you can use PSHUFB for parallel table lookups. Here is an example how this can be done. However, that makes it less portable.

It would be nice if you could try this patch out: …

That makes it much slower than my patch and your original version. (Changing the benchmarking code makes no big difference.)

-------------------------------------------------------------
Benchmark                   Time             CPU   Iterations
-------------------------------------------------------------
dnax_revcomp          1053474 ns      1023856 ns          679
bwa_bench              185624 ns       181061 ns         3902
bench/dna4_revcomp     186789 ns       176443 ns         3970
bench_seqan3_dna4     3610707 ns      3493823 ns          199

kloetzl avatar Jul 07 '20 07:07 kloetzl

Ooops. Turns out my benchmarking code was wrong. Because I DoNotOptimize(foo) I basically force byte-wise generation. Storing the result in a vector is properly vectorized.

-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
dnax_revcomp                1066001 ns      1027934 ns          657
bwa_bench                    184196 ns       178005 ns         3922
bench/dna4_revcomp           180400 ns       174404 ns         4028
original (master):
bench_seqan3_dna4            734305 ns       713637 ns          976
bench_seqan3_dna4_vector     800214 ns       777704 ns          908
my patch:
bench_seqan3_dna4            679484 ns       663801 ns         1055
bench_seqan3_dna4_vector     847492 ns       832866 ns          844
your patch:
bench_seqan3_dna4           3791564 ns      3716924 ns          195
bench_seqan3_dna4_vector     130958 ns       127241 ns         5311

So your new patch is really really fast. In fact it is much faster than I would expect it to be as it does a lot per loop iteration.

kloetzl avatar Jul 07 '20 09:07 kloetzl

Phew. I thought I'm stupid xD

A lot in a loop does not mean much, if it can pipeline a lot. (Which should be the case here)

Thank you for bringing this to our attention and I think it is extremely valuable to have this kind of constructive feedback.

I was disconnected form internet today:

I wanted to try this patch next

diff --git a/include/seqan3/alphabet/nucleotide/dna4.hpp b/include/seqan3/alphabet/nucleotide/dna4.hpp
index ae86fc609..4c224419f 100644
--- a/include/seqan3/alphabet/nucleotide/dna4.hpp
+++ b/include/seqan3/alphabet/nucleotide/dna4.hpp
@@ -47,11 +47,11 @@ class rna4;
  * If the special char conversion of IUPAC characters is not your desired behavior, refer to our cookbook for an
  * example of \ref cookbook_custom_dna4_alphabet to change the conversion behavior.
  */
-class dna4 : public nucleotide_base<dna4, 4>
+class dna4 : public nucleotide_base<dna4, 256>
 {
 private:
     //!\brief The base class.
-    using base_t = nucleotide_base<dna4, 4>;
+    using base_t = nucleotide_base<dna4, 256>;
 
     //!\brief Befriend seqan3::nucleotide_base.
     friend base_t;
@@ -82,6 +82,44 @@ public:
     }
     //!\}
 
+    static uint8_t constexpr alphabet_size = 4;
+
+    constexpr dna4 & assign_rank(rank_type const c) noexcept
+    {
+        base_t::assign_rank(c == 0 ? 'A' : (c == 1 ? 'C' : (c == 2 ? 'G' : 'T')));
+        return *this;
+    }
+
+    constexpr dna4 & assign_char(char_type const c) noexcept
+    {
+        base_t::assign_rank(c);
+        return *this;
+    }
+
+    constexpr char_type to_char() const noexcept
+    {
+        return base_t::to_rank();
+    }
+
+    constexpr rank_type to_rank() const noexcept
+    {
+        char_type c{to_char()};
+        char_type upper_char = c & 0b0101'1111; // to_upper
+        rank_type rank{};
+        rank = (upper_char == 'T') * 3 + (upper_char == 'G') * 2 + (upper_char == 'C');
+        return rank;
+    }
+
+    constexpr dna4 complement() const noexcept
+    {
+        char_type rank{to_char()};
+        rank ^= rank % 2 ? ('C' ^ 'G') : ('A' ^ 'T');
+
+        dna4 ret{};
+        static_cast<base_t &>(ret).assign_rank(rank);
+        return ret;
+    }
+
 protected:
     //!\privatesection
 

which is basically the same as not converting a char when reading it in and doing your pretty nifty trick in the complement with the conditional xor. :)

marehr avatar Jul 07 '20 17:07 marehr

@kloetzl Can you tell me the tool which you used to visualise the assembler / opcode? Thank you! :)

marehr avatar Jul 07 '20 19:07 marehr

Those are the Linux perf tools. Just perf record ./command --flags and it takes a profile. With perf report you get an overview of all functions and how much time is spent in each. You can then enter each function and inspect individual instructions. If you supply -ggdb you even get the instructions interleaved with the source.

kloetzl avatar Jul 07 '20 19:07 kloetzl

Btw, next time you tag a release please add the -a flag. Otherwise git describe uses 3.0.0 as reference.

(EDIT marehr: This will be tracked at https://github.com/seqan/product_backlog/issues/159)

kloetzl avatar Jul 08 '20 10:07 kloetzl

Found a bug in my code. Here are some new numbers using this patch.

--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
dnax_revcomp                 1060246 ns      1008337 ns          685
bwa_bench                     135160 ns       134486 ns         6061
bench/dna4_revcomp            138323 ns       137878 ns         5031
bench/dna4_revcomp_inline     108384 ns       106274 ns         5825
bench_seqan3_dna4            3618904 ns      3523777 ns          198
bench_seqan3_dna4_vector      134825 ns       129857 ns         5182

kloetzl avatar Aug 17 '20 12:08 kloetzl

I recomputed the table above with the latest master on my old laptop.

--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
dnax_revcomp                 1022730 ns      1020968 ns          572
bwa_bench                      99187 ns        98739 ns         7299
bench/dna4_revcomp             92767 ns        92545 ns         7401
bench/twiddle                 101125 ns       100873 ns         7023
bench_seqan3_dna4             701982 ns       701122 ns         1013
bench_seqan3_dna4_vector      804092 ns       772922 ns          922

kloetzl avatar Apr 05 '21 18:04 kloetzl

I recomputed the table above with the latest master on my old laptop.

--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
dnax_revcomp                 1022730 ns      1020968 ns          572
bwa_bench                      99187 ns        98739 ns         7299
bench/dna4_revcomp             92767 ns        92545 ns         7401
bench/twiddle                 101125 ns       100873 ns         7023
bench_seqan3_dna4             701982 ns       701122 ns         1013
bench_seqan3_dna4_vector      804092 ns       772922 ns          922

Thank you for reminding :) I guess those numbers are without the proposed patch.

Maybe I add some progress report:

We started to change our alphabet_base implementation to allow any kind of char to rank and rank to char conversion method. The same will be done for the complement implementation until the next release. This allows implementing a dna4_simd alphabet which uses auto-vectorisable arithmetic operations, like the one I proposed in this Patch.

One of the major problems we are facing right now, is how to provide a sane interface that provides the best implementation in both the generic and simd-aware use cases.

Some benchmarks showed that switching from look-up tables to arithmetic operations can have a drastic performance penality with SIMD disabled. Also use cases that only call seqan3::assign_char_to occasionally, will have slower performance.

It seems like enabling SIMD computation will need some special care, in the past I thought that adding a special view

std::vector<char> sequence{};
auto view = sequence | std::views::reverse | seqan3::views::simdify(seqan3::views::char_to<seqan3::dna4> | seqan3::views::complement);

might be a viable solution, but I'm not completely sure if that is still a good idea.

Right now I think changing the way how the API is called might be a good first step, e.g.

// single conversion, non-SIMDifiable
// always uses lookup-tables
dna4::char_to_rank('A') == 0;

using rank_type = seqan3::alphabet_rank_t<seqan3::dna4>;

std::vector<char> sequence{'A', 'C', 'G', 'T'};
std::vector<rank_type> ranks(4, 0); // same size as sequence

// std::span is contiguous memory, multiple conversions, SIMDifiable
// uses lookup-tables without any optimisations, uses SIMD-operations when using SSE4, ... etc.
dna4::char_to_rank(std::span{ranks} /*out-param*/, std::span{sequence} /*in-param*/);

std::vector<rank_type> complement_ranks(4, 0);
dna4::rank_complement(std::span{complement_ranks} /*out-param*/, std::span{ranks});

std::vector<char> complement_sequence(4, 'A');
dna4::char_complement(std::span{complement_sequence} /*out-param*/, std::span{sequence});

This would allow to provide a more "C" like interface to work directly on contiguous, raw-pointers and I guess would be a good building-block to provide a high-level interface such as seqan3::views::simdify.

marehr avatar Apr 06 '21 13:04 marehr

2022 update, numbers still look the same:

--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
dnax_revcomp                  778646 ns       774001 ns          887
bwa_bench                      97834 ns        97409 ns         6986
bench/dna4_revcomp             92363 ns        91299 ns         7631
bench_seqan3_dna4             749998 ns       711614 ns          974
bench_seqan3_dna4_vector      810274 ns       782115 ns          889

kloetzl avatar Mar 12 '22 15:03 kloetzl

If the small patch by @marehr up there (https://github.com/seqan/seqan3/issues/1970#issuecomment-654479994) already improves performance so much, can we just apply that, even if we don't have a more general solution for the other alphabets, yet?

edit: ah, I see, it slows things down for the non-vectorised use-case, right? May be just `#ifdef' around it?

h-2 avatar Apr 11 '22 18:04 h-2

Although from @marehr's last comment it doesn't feel like a good solution to just patch it, I would have to look deeper into this, which #ifdef would you propose? An extra macro that you can enable/disable with Cmake? This would be a lot of infrastructure for such a small thing.

smehringer avatar Apr 13 '22 07:04 smehringer

which #ifdef would you propose?

I would just check for AVX2:

#ifdef __AVX2__

But before adding, it would probably be good to have a benchmark included, as well. What was the latest benchmark code you were using, @kloetzl ?

h-2 avatar Apr 13 '22 14:04 h-2

I wouldn't add this as a #ifdef. Why do we have the alphabet concept?

My patch is just good for a certain use case and without carefully crafting your code that makes use of this patch (mainly auto-vectorizable loops on contiguous memory), you will have a performance degeneration for all other use cases (most user code will NOT be in this form).

I think the more appropriate option would be to add special alphabets for exactly this use case:

seqan3::simd::dna4_rank dna4_1; // (rank based implementation, my patch)
seqan3::simd::dna4_char dna4_2; // (char based implementation, kloetzl implementation)

This would just be one part, the second part would be the question of how to efficiently have sequence operations/algorithms, like e.g. "reverse_complement".

But, lacking an application that uses this, I don't know what the best API would be.

As I wrote in https://github.com/seqan/seqan3/issues/1970#issuecomment-814113239, my gut feeling is that having "algorithmic" functions (like in #include <algorithm>) would be the best way to encapsulate "simd" functionality. This is a bit contrary to our views-only approach, as it is more low-level.

marehr avatar Apr 13 '22 16:04 marehr

I agree with @marehr that we should not aim for an easy-patch solution when we do not have a use case at hand. (I don't count a specific benchmark as use case).

@h-2 An #ifdef checking merly for the availability of SIMD will most probably decrease performance in a lot of existing code.

Regarding the extra alphabet, we could maybe add a cookbook entry for now that basically provides an efficient dna4 simd alphabet with @marehr s patch applied. Or maybe a documentation entry in the alphabt module page or even an how-to if someone has the time. With a note that the user interested in it should report back to us if he/she has a specific use case.

smehringer avatar Apr 13 '22 17:04 smehringer

Re-reading this thread, kloetzl proposed a "better" complement implementation for dna4 / rna4.

He stated that

Adding the following function to dna4 makes complementing it about 16% faster. https://github.com/seqan/seqan3/issues/1970#issue-651799129

Currently, we use lookup tables for reverse complement,

https://github.com/seqan/seqan3/blob/e5c63f95004c67649ea0a6918348718852346e59/include/seqan3/alphabet/nucleotide/dna4.hpp#L160-L176

and kloetzl's patch

  public:
    constexpr dna4 complement() const noexcept
    {
        return dna4{}.assign_rank(3 - to_rank());
    }

or my patch

diff --git a/include/seqan3/alphabet/nucleotide/dna4.hpp b/include/seqan3/alphabet/nucleotide/dna4.hpp
index ae86fc609..9f60668ca 100644
--- a/include/seqan3/alphabet/nucleotide/dna4.hpp
+++ b/include/seqan3/alphabet/nucleotide/dna4.hpp
@@ -82,6 +82,18 @@ public:
     }
     //!\}
 
+    constexpr dna4 complement() const noexcept
+    {
+        return dna4{}.assign_rank(to_rank() ^ 0b11);
+    }

would use arithmetic operations instead. As both are "single" instruction solutions, they should always be at least as good as the lookup table (for all use cases).

See https://godbolt.org/z/TKKoEv1bc for generated instructions.

So I would suggest changing dna4's complement implementation from lookup to xor. (We/I already did something similar for the phred implementation, https://github.com/seqan/seqan3/commit/fdf35e500a9a0c57ba7ff7fab48ba52e82b29b24)

Regarding the extra alphabet, we could maybe add a cookbook entry for now that basically provides an efficient dna4 simd alphabet with @marehr s patch applied. Or maybe a documentation entry in the alphabt module page or even an how-to if someone has the time. With a note that the user interested in it should report back to us if he/she has a specific use case.

Adding a cookbook entry how to write an alphabet that works in auto-vectorized loops would be fine for me with that remark.

marehr avatar Apr 13 '22 17:04 marehr

But before adding, it would probably be good to have a benchmark included, as well. What was the latest benchmark code you were using, @kloetzl ?

Here you go: https://github.com/kloetzl/libdna/blob/master/bench2/revcomp.cxx

This would just be one part, the second part would be the question of how to efficiently have sequence operations/algorithms, like e.g. "reverse_complement".

But, lacking an application that uses this, I don't know what the best API would be.

I would argue that computing the reverse_complement is more common than just the complement on its own.

I wouldn't add this as a #ifdef.

One more problem with an ifdef is portability. If an application using seqan would be compiled as a debian package on a machine supporting AVX2 it would not run on a users machine without AVX2 support.

As both are "single" instruction solutions, they should always be at least as good as the lookup table (for all use cases).

Yes, xor and subtraction are near-optimal solutions. That should be good enough for most use cases.

kloetzl avatar Apr 13 '22 21:04 kloetzl

Alright, to me it looks like the following tasks would be open then:

  • [x] Add a benchmark for computing the complement (and reverse complement). See https://github.com/kloetzl/libdna/blob/master/bench2/revcomp.cxx
  • [x] Replace the look-up table with an xor+subtraction when computing the complement in dna4 as proposed by @marehr see comment
  • [ ] Do the above for all other nucleotide alphabets?
  • [x] Add a cookbook/documentation/how-to entry (depending on only-code/code-with-some-explanation/code-with-a-lot-of-explanation respectively) of how to write an alphabet that works in auto-vectorized loops.

What I don't see happening but is part of the discussion:

  • [ ] Add an #ifdef for SIMD available and apply this patch to dna4.

I'll propose this in the next core meeting on Monday :)

smehringer avatar Apr 14 '22 07:04 smehringer

I think for dna5 some potential arithmetic versions could be:

uint8_t dna5_complement_subtract(uint8_t rank)
{
    return min(3 - rank, 4); // using underflow
}

uint8_t dna5_complement_xor_is_dna5(uint8_t rank)
{
    bool is_n = rank > 3;
    return is_n ? rank : (rank ^ 0b11); // using conditional assignment
}

uint8_t dna5_complement_xor_min(uint8_t rank)
{
    rank = rank ^ 0b11;
    return min(rank, 4); // correcting reverse complement of N to be N
}

Assembler suggests that dna5_complement_xor_is_dna5 might be the best variant. See https://godbolt.org/z/qqq78fvb7

marehr avatar Apr 14 '22 09:04 marehr

Assembler suggests that dna5_complement_xor_is_dna5 might be the best variant. See https://godbolt.org/z/qqq78fvb7

That is very similar to how bwa does it.

kloetzl avatar Apr 15 '22 10:04 kloetzl

Core Meeting 2022-04-25

We agreed on the tasks outlined in comment https://github.com/seqan/seqan3/issues/1970#issuecomment-1098788971.

Writing the benchmark has highest priority.

smehringer avatar Apr 25 '22 11:04 smehringer

We merged https://github.com/seqan/seqan3/pull/3026 which adds an example for an auto-vectorizable dna4 alphabet and improves the performance of the dna4-complement and also adds a benchmark.

Thanks again a lot for your work, @kloetzl !

You are welcome to open another issue if you have other suggestions, we are excited to hear about them :)

eseiler avatar Oct 21 '22 10:10 eseiler

That's great to hear! Pleased I could help and we managed to produce such a nice speed up. :+1:

kloetzl avatar Oct 21 '22 19:10 kloetzl