seqan3 icon indicating copy to clipboard operation
seqan3 copied to clipboard

several same-score pairwise alignments

Open notestaff opened this issue 8 months ago • 6 comments

Platform

  • SeqAn version: 3.3.0
  • Operating system:

Linux bpb23-acc 5.4.0-136-generic #153-Ubuntu SMP Thu Nov 24 15:56:58 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

  • Compiler: x86_64-conda-linux-gnu-c++ (conda-forge gcc 12.3.0-2) 12.3.0

Question

In pairwise_align(), when several possible alignments have the same score, how does the code pick which one to return? Can it be asked to return all of them (lazily), like BioPython's Bio.Align.PairwiseAligner does? Thanks!

@eseiler

notestaff avatar Oct 11 '23 16:10 notestaff

Hey @notestaff,

if there are multiple valid alignments, the traceback will choose in this order:

  1. Diagonal (match)
  2. Vertical (gap in the first sequence)
  3. Horizontal (gap in the second sequence)

The alignments are always evaluated lazily, i.e., they are only computed if you access them.

As far as I am aware, there is no way to force an output of all possible (alternative) alignments.

Click to show example
#include <vector>

#include <seqan3/alignment/aligned_sequence/debug_stream_alignment.hpp>
#include <seqan3/alignment/configuration/align_config_edit.hpp>
#include <seqan3/alignment/pairwise/align_pairwise.hpp>
#include <seqan3/alphabet/nucleotide/dna4.hpp>
#include <seqan3/core/debug_stream.hpp>

int main()
{
    using namespace seqan3::literals;

    std::vector<seqan3::dna4> const sequence_1{"ACGT"_dna4};
    std::vector<seqan3::dna4> const sequence_2{"AGCT"_dna4};

    auto config = seqan3::align_cfg::method_global{} | seqan3::align_cfg::edit_scheme;

    for (auto const & result : seqan3::align_pairwise(std::tie(sequence_1, sequence_2), config))
        seqan3::debug_stream << result << '\n';
}

Will output:

{sequence1 id: 0, sequence2 id: 0, score: -2, begin: (0,0), end: (4,4), 
alignment:
      0     .
        A-CGT
        | | |
        AGC-T
}

An alternative is:

ACG-T
| | |
A-GCT

but the algorithm will prefer putting the gap in the first sequence first.


@rrahn will also have a look at your questions and correct me/expand on my answers if necessary :)

What would the use case be for iterating over all alternative alignments? Please don't hesitate to ask more questions if something is unclear.

eseiler avatar Oct 15 '23 10:10 eseiler

Hi @eseiler and @rrahn,

Thanks for the response (and for seqan generally). My current use case involves converting some code from python/biopython to C++/seqan3, and trying to ensure that the new code exactly reproduces the old results. I've realized now that the code path which used alignment enumeration is inactive, so don't have an immediate need for that function. However, it would help a lot if there was a systematic way to exactly reproduce, in seqan3, biopython's choice of alignment from among the top-scoring alignments. Maybe, the order in which the traceback chooses directions could be explicitly parameterized (with compile-time choices)? Then I could choose the order corresponding to biopython's implementation.

notestaff avatar Oct 17 '23 16:10 notestaff

Hi @notestaff,

thanks for writing us. If I understand you correctly, then you are interested in all cooptimal alignments that yield an optimal alignment score? The short answer is: It is indirectly possible by forcing align_pairwise(...) to output the trace matrix:

#include <seqan3/alignment/configuration/align_config_debug.hpp>
#include <seqan3/alignment/configuration/align_config_gap_cost_affine.hpp>
#include <seqan3/alignment/configuration/align_config_method.hpp>
#include <seqan3/alignment/configuration/align_config_scoring_scheme.hpp>
#include <seqan3/alignment/matrix/detail/debug_matrix.hpp>
#include <seqan3/alignment/matrix/detail/trace_directions.hpp>
#include <seqan3/alignment/pairwise/align_pairwise.hpp>
#include <seqan3/alignment/scoring/nucleotide_scoring_scheme.hpp>
#include <seqan3/alphabet/nucleotide/dna4.hpp>

auto align_config = seqan3::align_cfg::method_global{} | 
                    seqan3::align_cfg::scoring_scheme{
                        seqan3::nucleotide_scoring_scheme{seqan3::match_score{4},
                                                          seqan3::mismatch_score{-5}}
                    } |
                    seqan3::align_cfg::gap_cost_affine{seqan3::align_cfg::open_score{-10},
                                                       seqan3::align_cfg::extension_score{-1}
                    } |
                    seqan3::align_cfg::detail::debug{};

auto source = "AACCGGTTAACCGGTT"_dna4;
auto target = "ACGTACGTA"_dna4;

using trace_matrix_t = seqan3::detail::two_dimensional_matrix<std::optional<seqan3::detail::trace_directions>>; 
auto alignments = seqan3::align_pairwise(std::tie(source, target), align_config);

for (auto && alignment : alignments) {
    trace_matrix_t trace_matrix{alignment.trace_matrix()};
    // in case you are using local alignments you can use 
    // alignment.sequence1_end_position()/alignment.sequence2_end_position() to find the column and row index where the alignment starts in the trace matrix and 
    // alignment.sequence1_begin_position()/alignment.sequence2_begin_position() to find the column and respectively row index where the alignment ends in the trace matrix.
    // you can then implement a traversal over the trace matrix to enumerate all paths between the end and begin point. 
}

Note that this soulution, however, is not implemented with efficiency in mind but thought as a debug feature to evaluate the correctness of the computed alignments. Still it would give you an entry point to replicate the desired BioPython functionality. You can find the corresponding API documentation of the trace_matrix in the developer API: http://docs.seqan.de/seqan3/3-master-dev/classseqan3_1_1detail_1_1two__dimensional__matrix.html.

If anything is unclear or does not work as expected, just contact us.

rrahn avatar Oct 19 '23 11:10 rrahn