seqan3 Global MSA with free end-gaps for a subset of sequences

Global MSA with free end-gaps for a subset of sequences

Open Robin-Rounthwaite opened this issue 3 years ago • 9 comments

Platform

SeqAn version: 2.4.0 (plan to update to seqan3 soon)
Operating system: Linux robin-Surface-Book 5.12.2-surface #1 SMP Tue May 11 18:20:34 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Compiler: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

Question

Is there a way to perform a global MSA that gives some sequences free end-gaps? I have a use case that requires that most of the sequences to be aligned globally, but permits a few sequences to cease alignment in the middle of other sequences.

My Use Case: I'm building a tool for vg that uses Seqan's MSA to simplify subgraphs of a genome variation graph. To do this, I:

find complicated subgraphs within a graph,
extract the sequences used to make that subgraph,
realign them using Seqan's MSA
Transform the MSA into a new subgraph, which allows me to
Replace the old, complex subgraph with the new, hopefully simpler, subgraph.

Depending on how the graph is constructed, this approach can simplify the way the graph represents regions with complex genetic variation.

The problem is, some of the subgraphs I'm simplifying have sequences that either start or end in the middle of the complex subgraph. I don't want to penalize a gap for those sequences on their "free end," because they're not supposed to stretch across the subgraph from beginning end.

If, for example, I have a sequence within the subgraph that started outside the subgraph, but ends inside the subgraph, I'd want to penalize an end-gap for the left side of the sequence, but not the right. Is there a way to do that, while still performing a global MSA for the rest of the sequences in the subgraph?

A second sub-question: Is there a way to guarantee that a given sequence will align both its first and last character? For my particular use-case, it wouldn't make sense to allow the string to end on a gap-alignment, since that would mean that the entire graph could plausibly end at the string's last character, which isn't the case.

Hopefully I've described my question, sub-question, and use case in a way that makes sense. Please let me know if there's anything I could further clarify.

Thanks,

Robin

May 20 '21 19:05 Robin-Rounthwaite

Hi Robin,

thank you for your excellent and helpful write-up of your use case.

I'm unfortunately not too familiar with MSA and my answer might lack understanding. Furthermore, two of our key-person who understand MSA extensively are not available right now (one left the group and the other one is on parental leave).

Maybe the most important information right at the start: SeqAn3 does not have any MSA functionality, yet. So you might not need to upgrade, yet. If you think that SeqAn2 is deprecated, and you feel the urge to upgrade. That isn't the case, it is in a "feature"-freeze mode, and it is perfectly valid to use SeqAn2.

We have some user feature requests for MSA (see https://github.com/seqan/seqan3/issues/1763), a pending and stale pull request (https://github.com/seqan/seqan3/pull/1990) that uses the seqan2 implementation as the backend and puts that into a seqan3-style frontend, and a general design issue (https://github.com/seqan/product_backlog/issues/104).

If, for example, I have a sequence within the subgraph that started outside the subgraph, but ends inside the subgraph, I'd want to penalize an end-gap for the left side of the sequence, but not the right. Is there a way to do that, while still performing a global MSA for the rest of the sequences in the subgraph?

As I stated, I'm not an expert and haven't used the MSA part of SeqAn2. From what I caught on in our meetings, I think MSA had a concept of "Libraries". If I remember correctly, those represented the Alignments, and from what I heard, you could individually have different alignment methods. (Just from Memory, I'm not sure if I misremembered something). For example one alignment could be computed by using the edit distance (Levenshtein distance), one could just compute the Hamming distance and another one the longest common substring. And those alignments could be put into constructing the MSA.

If my memory is right, I think you could have different alignment computations (each with a different "free-end" configuration) depending on whether your sequence intersect your subgraph or is completely contained.

Is there a way to guarantee that a given sequence will align both its first and last character? For my particular use-case, it wouldn't make sense to allow the string to end on a gap-alignment, since that would mean that the entire graph could plausibly end at the string's last character, which isn't the case.

You mean ACGT and ACGTA should align to

ACG-T   instead of ACGT-
|||                ||||-
ACGTA              ACGTA

I think via a scoring scheme that will be hard. You could penalise gaps (gap-opening for example) more than mismatches altogether, but I think you could always think of sequences that still puts a gap at the end. You could just align a sub-sequence that excludes your first and last character and add the score of that first and last character after the computation. I need to think about this more, maybe there is a more elegant solution to this. What beneficial property would such a forced first and last character match have?

Or do you mean that the complete first and second sequence must be matched (aka global alignment)?

May 21 '21 14:05 marehr

Thank you for your thoughtful response, @marehr! I just got back from some important travels, and I was glad to see your thoughts. They are helpful.

That's good to hear about SeqAn2 - I hadn't realize that.

I will look into libraries for seqan2, and see if I can use them to input already-complete alignments into the MSA, the way you suggest. That would work well for my use-case.

With regards to your last question, the problem may actually be less odd than that. As I think you've understood quite well, there are two types of alignments I'd like to have:

when I have strings that stretch across both beginning and end of my subgraph, I'd want global alignments where the edge-characters are all paired together. E.g.:

ACTTT
| | |
A-T-T 

instead of 
ACTTT
| ||
A-TT-

This way, any old sequence-backed path through the subgraph is conserved in the new subgraph. Note that the first character and last character of the strings will always be identical between these strings, because in the graph they were already collapsed into a single, shared node at the beginning/end of my subgraph.

I've tried to guarantee alignment between these first & last characters by using special characters to represent the first and last characters of the string:

YCTTX
| | |
Y-T-X

The idea is that the aligner will match up the Y's and X's because there isn't any better place to put them. But I still risk having the aligner put gaps on those ends in some edge cases. So, my first follow-up question: Is there a way to ensure that special characters like Y's and X's always align? Perhaps by assigning extremely high penalties for gaps and mismatches for those characters? Alternatively, Is there a way to perform global alignments where free end-gaps are extremely expensive, to force the end-characters to align?

The other type of alignment I'd like to perform is when a substrings start and/or end doesn't stretch across the subgraph's start and/or end nodes. In those cases, I want to prevent the algorithm from introducing gaps at the ends of those sequences. I'll use an example string that stretches across the start node of the graph, but ends before the end node.

An invalid alignment solution, where the final T in my chosen string is associated with a gap in the other string:

(my chosen string)
AC-GT
|| |
ACTG-ACGGGATC
(another string that stretches across start and end)

In the graph, that would make the chosen string's T an "end node," which wouldn't reconnect to the rest of the graph. A dead end in the middle of the subgraph. If the subgraph was in the middle of a chromosome, that would indicate that some individuals had a chromosome that was truncated in the middle, which wasn't the case.

A couple valid solutions, where the chosen string's final T is aligned to another T:

AC-G------T
|| |      |
ACTGACGGGATC

or

ACGT
|| |
AC-TGACGGGATC

Essentially, I want to give one string no end-gap penalty on one side, which allows it to finish alignment long before the rest of the string is finished. But I want to prevent final alignments to a gap. So, my second question: Is it possible to force the final character of my chosen string to align as a match somewhere in the graph?

Thanks for all your help. I really appreciate it.

May 26 '21 17:05 Robin-Rounthwaite

Sorry for the late response, I'm terrible at keeping conversations alive 🙈

I sketched for (1) a solution in seqan3, the basic idea is to align the subsequences without the first and last char and extend the aligned sequences by 1 char on both sides afterwards. Unfortunately, I need to copy the complete gap information of the alignment, I think we could "optimise" such a use case to reduce the number of copy operations.

I think I saw the concept of soft-clips (which your use case reminds me of) in seqan2, maybe we can do the same thing more efficient in seqan2.

#include <seqan3/std/span>

#include <seqan3/alignment/configuration/align_config_edit.hpp>
#include <seqan3/alignment/pairwise/align_pairwise.hpp>
#include <seqan3/alphabet/nucleotide/dna4.hpp>
#include <seqan3/core/debug_stream.hpp>

template <seqan3::alphabet alphabet_t, seqan3::aligned_sequence aligned_subsequence_t>
auto softclip_aligned_sequence(std::span<alphabet_t> sequence,
                               size_t softclip_begin,
                               size_t softclip_end,
                               aligned_subsequence_t aligned_subsequence)
{
    seqan3::gap_decorator<std::span<alphabet_t>> aligned_sequence{sequence};

    auto aligned_sequence_it = aligned_sequence.begin();

    // skip first softclip_begin many chars; as it is fixed to be not a gap
    aligned_sequence_it = std::ranges::next(aligned_sequence_it, softclip_begin);

    auto maybe_insert_gaps = [&aligned_sequence](auto aligned_sequence_it, size_t gaps)
    {
        if (gaps == 0)
            return aligned_sequence_it;

        aligned_sequence_it = seqan3::insert_gap(aligned_sequence, aligned_sequence_it, gaps);

        // skip last inserted gap
        return ++aligned_sequence_it;
    };

    size_t gaps = 0;
    for (auto && chr: aligned_subsequence)
    {
        if (chr == seqan3::gap{}) // gap char
        {
            // collect "runs" of gaps
            ++gaps;
            ++softclip_end;
            continue;
        }

        // non-gap char
        aligned_sequence_it = maybe_insert_gaps(aligned_sequence_it, gaps);
        assert(chr == *aligned_sequence_it); // should have same char

        ++aligned_sequence_it; // advance read non-gap char

        gaps = 0; // reset "run" of gaps
    }

    aligned_sequence_it = maybe_insert_gaps(aligned_sequence_it, gaps);

    assert(std::ranges::equal(seqan3::views::slice(aligned_sequence, softclip_begin, softclip_end),
                              aligned_subsequence));

    return aligned_sequence;
}

int main()
{
    using namespace seqan3::literals;

    seqan3::dna4_vector sequence1{"AGTAGACTACG"_dna4};
    seqan3::dna4_vector sequence2{"ACGTACGACACG"_dna4};

    seqan3::debug_stream << "sequence1: " << sequence1 << '\n';
    seqan3::debug_stream << "sequence2: " << sequence2 << '\n';

    size_t softclip1_begin = 1;
    size_t softclip2_begin = 1;
    size_t softclip1_end = sequence1.size() - 1;
    size_t softclip2_end = sequence2.size() - 1;

    // create sub-sequences
    std::span subsequence1 = std::span{sequence1}.subspan(softclip1_begin, softclip1_end - softclip1_begin);
    std::span subsequence2 = std::span{sequence2}.subspan(softclip2_begin, softclip2_end - softclip2_begin);

    seqan3::debug_stream << "subsequence1: " << subsequence1 << '\n';
    seqan3::debug_stream << "subsequence2: " << subsequence2 << '\n';

    auto configuration = seqan3::align_cfg::method_global{} | seqan3::align_cfg::edit_scheme;
    for (auto && result: seqan3::align_pairwise(std::tie(subsequence1, subsequence2), configuration))
    {
        // use indices `result.sequence1/2_id()`` to get original sequence // skipped here for simplicity
        std::span unaligned_sequence1{sequence1};
        std::span unaligned_sequence2{sequence2};

        auto [aligned_subsequence1, aligned_subsequence2] = result.alignment();
        seqan3::debug_stream << "aligned_subsequence1: " << aligned_subsequence1 << std::endl;
        seqan3::debug_stream << "aligned_subsequence2: " << aligned_subsequence2 << std::endl;

        // rebuild aligned_sequences
        seqan3::gap_decorator aligned_sequence1
            = softclip_aligned_sequence(unaligned_sequence1, softclip1_begin, softclip1_end, aligned_subsequence1);
        seqan3::gap_decorator aligned_sequence2
            = softclip_aligned_sequence(unaligned_sequence2, softclip2_begin, softclip2_end, aligned_subsequence2);

        seqan3::debug_stream << "aligned_sequence1: " << aligned_sequence1 << std::endl;
        seqan3::debug_stream << "aligned_sequence2: " << aligned_sequence2 << std::endl;
    }
}

outputs:

sequence1: AGTAGACTACG
sequence2: ACGTACGACACG
subsequence1: GTAGACTAC
subsequence2: CGTACGACAC
aligned_subsequence1: -GTA-GACTAC
aligned_subsequence2: CGTACGAC-AC
aligned_sequence1: A-GTA-GACTACG
aligned_sequence2: ACGTACGAC-ACG

I'll try your second approach of adding a custom scoring scheme to heavily penalise start and end chars.

Jun 08 '21 13:06 marehr

Hi Marehr! Thanks for the detailed example solution for guaranteeing flanking matches between two sequences! I currently have an okay-working piece of code for the same problem, with the special start and end chars. I think the last thing I'd need to guarantee that it works is the heavy penalty for mismatches - which, once I find out how to make a custom scoring matrix, should be doable.

I'm especially curious about your thoughts with the second part of my question: Is it possible to force the final character of my chosen string to align as a match somewhere in the graph?

If you have any insights for that, I'd greatly appreciate it. Wishing you well, - Robin

Jun 08 '21 17:06 Robin-Rounthwaite

Is it possible to force the final character of my chosen string to align as a match somewhere in the graph?

For (2) I think this is not possible with our current standard alignment algorithm (seqan2 or seqan3). We can penalise char-specific mismatches, but we can't penalise char-specific gaps.

https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm defines the recursion as:

F[i][j] = max(F[i-1][j-1] + S(A[i], B[i]), F[i][j-1] + d, F[i-1][j] + d)

as image:

Note that d is fixed and completely independent of the current char. That means if we want to have a "standard" alignment (in the sense of cost configuration), we can't differentiate between a special start / end char and a normal middle char. The only way to force a match at the start / end would be to configure a rather high gap opening / extension cost. (Which in turn has no 100% guarantee!)

The recursion above is how we implement it right now for the linear gap scheme. (We actually implement the affine gap scheme, a bit more generalised version, but that does not change the underlying argument).

Funnily, the German wiki has a variant for the recursion https://de.wikipedia.org/wiki/Smith-Waterman-Algorithmus#Matrix-Rekurrenzen (I'm not sure why this is in the Smith-Waterman article, at first glance this seems to be different from the usual recursion):

F[i][j] = max(F[i-1][j-1] + S(A[i], B[i]), F[i][j-1] + S(A[i], '-'), F[i-1][j] + S('-', B[i]))

as image:

This is interesting, because this would allow to associate a cost between a character and a gap symbol.

In your case we could associate "infinite" cost between the special start character and all other chars (including the in/del-gap) and zero cost with itself.

I don't know if this more formal way is easier to understand:

A new alphabet Σ' = Σ x {S, M, E} (each char gets a Start, Middle and End position)

A new score matrix for a,b in Σ ∪ {'-'}

S'((a, M), (b, X)) := S(a, b) if X == S, ∞ otherwise; same scoring as before
S'((a, S), (b, X)) := ∞ if X in {M, E}, 0 otherwise ; start char can only match start chars
S'((a, E), (b, X)) := ∞ if X in {S, M}, 0 otherwise ; end char can only match end chars

This effectively would allow associating different cost on segments of the alignment. I.e. forcing the start char to match, computing a normal alignment in the middle and forcing the end to match; all in one alignment computation.

EDIT:// It is actually a solution to your (1) problem and not exactly a solution for your (2) problem, but I have a feeling that this might lead to a solution.

Jun 08 '21 18:06 marehr

No, this should definitely work! The second example is a semi-global alignment with the twist that an "T" end char in the first sequence has zero cost to any other "T" middle/end char in the second sequence and ∞ otherwise.

Jun 08 '21 18:06 marehr

That would work, indeed! That would be great! But these S/M/E score matrices aren't possible in current versions of Seqan, is that right? How difficult would it be to add this to the library?

Jun 08 '21 20:06 Robin-Rounthwaite

Yeah, this is not implemented, yet. But, I think as this functionality is "just" ™️ a swap of the gap extension definition within the recursion formula, this should be definitely prototype-able.

I guess

https://github.com/seqan/seqan3/blob/b5b35262521fbfa5735d1819b06cc6f70018063b/include/seqan3/alignment/pairwise/detail/policy_affine_gap_recursion.hpp#L119-L140

would be a good starting point. (maybe passing the extension score sequence1/sequence2 in, as we do it for the sequence_score?)

Jun 08 '21 23:06 marehr

Cool! I haven't quite reached the point where this is the most urgent next step for my project. But I oughta give it a shot soon. Thanks for the all the help.

Jun 10 '21 17:06 Robin-Rounthwaite

seqan3 seqan3 copied to clipboard

Global MSA with free end-gaps for a subset of sequences

Platform

Question

seqan3
seqan3 copied to clipboard