seqan3 icon indicating copy to clipboard operation
seqan3 copied to clipboard

Running sequence alignment with alphabets with more than 256 characters

Open andrestiraboschieclypsium opened this issue 10 months ago • 3 comments

Platform

  • SeqAn version: 3.4.0
  • Operating system: Linux
  • Compiler: g++ (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0

Question

Hi, Is it possible to compute sequence alignments suing alphabets larger than 256 characters?

For instance I tried running one of the examples from the tutorials using an alphabet with more than 256 characters I defined like this:

class example_alphabet : public seqan3::alphabet_base<example_alphabet, 1333, char16_t>
{
.
.
.
};

and when building this code:

    // Invoke the pairwise alignment which returns a lazy range over alignment results.
    auto results_example = seqan3::align_pairwise(std::tie(example_alphabet_vector_1, example_alphabet_vector_2), config);
    auto & res_example = *results_example.begin();
    seqan3::debug_stream << "Score: " << res_example.score() << '\n';
    return 0;

I get these kind of errors:

/home/eclypsium/Workspace/v2d/seqan3/tutorial/seqan3/include/seqan3/alphabet/composite/alphabet_variant.hpp:134:25: error: static assertion failed: The alphabet_variant is currently only tested for alphabets with char_type char. Contact us on GitHub if you have a different use case: https://github.com/seqan/seqan3 .
  134 |     static_assert((std::is_same_v<alphabet_char_t<alternative_types>, char> && ...),
      | 

Looking at the code it seems that char as is hardcoded in many places as char_type. Is there any way to circumvent this?

Best regards, Andrés Tiraboschi

Hey there,

We had a similar issue in #3271.

Since your alphabet type would fit in a char16_t, it should be possible to modify alphabet_variant to allow for this.

I will try it out tomorrow.

eseiler avatar Feb 19 '25 18:02 eseiler

When I apply this patch

Click to show
diff --git a/include/seqan3/alphabet/composite/alphabet_variant.hpp b/include/seqan3/alphabet/composite/alphabet_variant.hpp
index 82b035a99..df411d921 100644
--- a/include/seqan3/alphabet/composite/alphabet_variant.hpp
+++ b/include/seqan3/alphabet/composite/alphabet_variant.hpp
@@ -121,18 +121,22 @@ template <typename... alternative_types>
     requires (detail::writable_constexpr_alphabet<alternative_types> && ...) && (std::regular<alternative_types> && ...)
           && (sizeof...(alternative_types) >= 2)
 class alphabet_variant :
-    public alphabet_base<alphabet_variant<alternative_types...>,
-                         (static_cast<size_t>(alphabet_size<alternative_types>) + ...),
-                         char>
+    public alphabet_base<
+        alphabet_variant<alternative_types...>,
+        (static_cast<size_t>(alphabet_size<alternative_types>) + ...),
+        std::conditional_t<(std::same_as<alphabet_char_t<alternative_types>, char> && ...), char, char16_t>>
 {
 private:
     //!\brief The base type.
-    using base_t = alphabet_base<alphabet_variant<alternative_types...>,
-                                 (static_cast<size_t>(alphabet_size<alternative_types>) + ...),
-                                 char>;
-
-    static_assert((std::is_same_v<alphabet_char_t<alternative_types>, char> && ...),
-                  "The alphabet_variant is currently only tested for alphabets with char_type char. "
+    using base_t = alphabet_base<
+        alphabet_variant<alternative_types...>,
+        (static_cast<size_t>(alphabet_size<alternative_types>) + ...),
+        std::conditional_t<(std::same_as<alphabet_char_t<alternative_types>, char> && ...), char, char16_t>>;
+
+    static_assert(((std::is_same_v<alphabet_char_t<alternative_types>, char>
+                    || std::is_same_v<alphabet_char_t<alternative_types>, char16_t>)
+                   && ...),
+                  "The alphabet_variant is currently only tested for alphabets with char_type char or char16_t. "
                   "Contact us on GitHub if you have a different use case: https://github.com/seqan/seqan3 .");
 
     //!\brief Befriend the base type.

It seems to work just fine

Click to show
#include <seqan3/alignment/pairwise/align_pairwise.hpp>
#include <seqan3/alphabet/alphabet_base.hpp>
#include <seqan3/core/debug_stream.hpp>

namespace example
{

class example_alphabet : public seqan3::alphabet_base<example_alphabet, 1333, char16_t>
{
    using base_t = seqan3::alphabet_base<example_alphabet, 1333, char16_t>;

public:
    using base_t::base_t;

    static constexpr char16_t rank_to_char(rank_type const rank)
    {
        return static_cast<char16_t>(rank);
    }

    static constexpr rank_type char_to_rank(char16_t const chr)
    {
        return static_cast<rank_type>(chr);
    }
};

inline namespace literals
{

constexpr example_alphabet operator""_example(char const c) noexcept
{
    return example_alphabet{}.assign_char(c);
}

constexpr std::vector<example_alphabet> operator""_example(char const * const s, size_t const n)
{
    std::vector<example_alphabet> r;
    r.resize(n);

    for (size_t i = 0; i < n; ++i)
        r[i].assign_char(s[i]);

    return r;
}

} // namespace literals

} // namespace example

int main()
{
    using namespace example::literals;

    std::vector<example::example_alphabet> seq1 = "ACGTGATG!!@@++"_example;
    std::vector<example::example_alphabet> seq2 = "AGTGATACT!!@@++"_example;

    seqan3::configuration cfg = seqan3::align_cfg::method_global{} | seqan3::align_cfg::edit_scheme;
    auto results_example = seqan3::align_pairwise(std::tie(seq1, seq2), cfg);

    auto & res_example = *results_example.begin();
    seqan3::debug_stream << "Score: " << res_example.score() << '\n';

    // char16_t cannot be printed directly, so we need to convert it to char.
    auto adaptor = std::views::transform(
        [](auto const & in)
        {
            auto letter = seqan3::to_char(in);
            return static_cast<char>(letter);
        });

    auto && [p1, p2] = res_example.alignment();
    seqan3::debug_stream << adaptor(p1) << '\n';
    seqan3::debug_stream << adaptor(p2) << '\n';

    // Score: -4
    // ACGTGATG--!!@@++
    // A-GTGATACT!!@@++
}

Seems like alphabet_variant is the only gatekeeper. All other parts are generic and use the rank/char type of the alphabet.

eseiler avatar Feb 20 '25 14:02 eseiler

Cool! Thanks I'll give it a try