seqan3 icon indicating copy to clipboard operation
seqan3 copied to clipboard

[DOC, FEATURE] Improve dna4 complement performance and add documentation.

Open smehringer opened this issue 3 years ago • 4 comments

(Aims to) Resolves #1970

  • [x] Add a benchmark for computing the complement (and reverse complement). See https://github.com/kloetzl/libdna/blob/master/bench2/revcomp.cxx
  • [x] Replace the look-up table with an xor+subtraction when computing the complement in dna4 as proposed by @marehr see https://github.com/seqan/seqan3/issues/1970#issuecomment-1098330797
  • [x] Add a cookbook/documentation/how-to entry (depending on only-code/code-with-some-explanation/code-with-a-lot-of-explanation respectively) of how to write an alphabet that works in auto-vectorized loops.

smehringer avatar Jun 21 '22 07:06 smehringer

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated
seqan3 ✅ Ready (Inspect) Visit Preview Oct 21, 2022 at 8:18AM (UTC)

vercel[bot] avatar Jun 21 '22 07:06 vercel[bot]

Codecov Report

Base: 98.24% // Head: 98.25% // Increases project coverage by +0.00% :tada:

Coverage data is based on head (68c7bd4) compared to base (e723338). Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #3026   +/-   ##
=======================================
  Coverage   98.24%   98.25%           
=======================================
  Files         275      276    +1     
  Lines       12318    12344   +26     
=======================================
+ Hits        12102    12128   +26     
  Misses        216      216           
Impacted Files Coverage Δ
include/seqan3/alphabet/nucleotide/dna4.hpp 100.00% <100.00%> (ø)
test/include/seqan3/test/performance/simd_dna4.hpp 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov[bot] avatar Jun 21 '22 08:06 codecov[bot]

I played around a bit:

https://quick-bench.com/q/RtiW9XqNDG_O9BzGo8gUzKAZQNo

constexpr uint8_t rank(char c) noexcept
{
    uint8_t rank = static_cast<uint8_t>(c);
    rank |= (rank & 0b0001'0000) >> 1;
    rank += 1u;
    rank >>= 2;
    rank &= 0b0000'0011;
    return rank;
}
A = 0100'0001 
C = 0100'0011
G = 0100'0011
T = 0101'1011
a = 0110'0001
c = 0110'0011
g = 0110'0111
t = 0111'0100
rank |= (rank & 0b0001'0000) >> 1; // Set 4th bit to one if 5th bit is set (t and T)
rank += 1u; // Luckily flips the bits we need to flip
rank >>= 2; // Drop lowest 2 bits
rank &= 0b0000'0011; // Mask other bits
return rank;

Works for both lower and upper case chars. Microbenchmark looks promising.

I still need to test it with this PR and auto-vectorization.

eseiler avatar Jun 21 '22 09:06 eseiler

assign_char<seqan3::dna4>                             80.6 ns         7183289
assign_char<seqan3::rna4>                             79.1 ns         8674274
assign_char<seqan3::simdifyable_dna4>                  184 ns         3756637

to_char<seqan3::dna4>                                 66.0 ns         10194749
to_char<seqan3::rna4>                                 65.8 ns         10030062
to_char<seqan3::simdifyable_dna4>                      100 ns          6781066

====

complement<tag::revcomp_dna4_inline>                 24971 ns         26504
complement<tag::seqan3_dna4>                        303294 ns          2297
complement<tag::seqan3_dna4_vector>                 450026 ns          1557
complement<tag::seqan3_dna4_simdifyable>           2025158 ns           347
complement<tag::seqan3_dna4_simdifyable_vector>      31200 ns         22344

^^^ current assign_char
vvv other assign_char

complement<tag::revcomp_dna4_inline>                 24949 ns         27054
complement<tag::seqan3_dna4>                        302653 ns          2299
complement<tag::seqan3_dna4_vector>                 466759 ns          1490
complement<tag::seqan3_dna4_simdifyable>            710485 ns           961
complement<tag::seqan3_dna4_simdifyable_vector>      38617 ns         18205

Still needs a few things to be fully dna4-compatible.

As a side note, if we were to do the complement in assign_char, we would get a bit closer:

    constexpr simdifyable_dna4 & assign_char(char_type const c) noexcept
    {
        char_type const upper_char = c & 0b0101'1111; // to_upper
        rank_type rank = (upper_char == 'A') * 3 + (upper_char == 'C') * 2 + (upper_char == 'G');
        return assign_rank(rank);
    }
complement<tag::revcomp_dna4_inline>                 25064 ns        27042
complement<tag::seqan3_dna4>                        303093 ns         2147
complement<tag::seqan3_dna4_vector>                 451256 ns         1552
complement<tag::seqan3_dna4_simdifyable>           2029439 ns          340
complement<tag::seqan3_dna4_simdifyable_vector>      28431 ns        24158

eseiler avatar Jun 21 '22 17:06 eseiler