ksw2 icon indicating copy to clipboard operation
ksw2 copied to clipboard

Unexpected behavior of ksw_extd when tlen > qlen + w

Open mimaric opened this issue 3 years ago • 0 comments

I am comparing the behavior of ksw_extd() and ksw_extd2_sse(). One difference I noticed is the case when tlen > qlen + w, i.e. when target is long enough for some rows of the matrix to not contain a band while when doing a "full" alignment (KSW_EZ_EXTZ_ONLY not specified). In that case backtracking starts at position (tlen-1, qlen-1) which is outside of band.

For ksw_extd2_sse() in that case this condition is true: https://github.com/lh3/ksw2/blob/4e0a1ccba8c6ccc87e0342c9712531bde783bf90/ksw2.h#L129 As force_state == 1 this condition is true: https://github.com/lh3/ksw2/blob/4e0a1ccba8c6ccc87e0342c9712531bde783bf90/ksw2.h#L141 so i (target index) keeps being reduced until tlen == qlen + w, at which point we are within the band and a good (although likely not optimal) alignment is found.

For ksw_extd() in this case this condition is true: https://github.com/lh3/ksw2/blob/4e0a1ccba8c6ccc87e0342c9712531bde783bf90/ksw2.h#L132 followed by this one: https://github.com/lh3/ksw2/blob/4e0a1ccba8c6ccc87e0342c9712531bde783bf90/ksw2.h#L143 so in this case j (query index) gets reduced and we are effectively moving away from the band until we reach qlen == 0, after which we move upwards, thus only generating a lot of insertions followed by a lot of deletions.

I would say that in this case ksw_backtrack() should be edited so that for is_rot == false instead of

if (j < off[i]) force_state = 2;
if (off_end && j > off_end[i]) force_state = 1;

there should be

if (j < off[i]) force_state = 1;
if (off_end && j > off_end[i]) force_state = 2;

In this case we would properly move upwards (reducing target index) until reaching the band, like for ksw_extd2_sse().

The same argument holds for cases when qlen > tlen + w, i.e. starting element is right of the band. In that case ksw_extd2_sse() performs insertions (reduces query index) until it reaches the band, whereas ksw_extd() decreases target index index until reaching tlen == 0, after which it just decreases query index, again effectively doing a lot of deletions follow by insertions, completely avoiding the band. The change I proposed above would also solve this issue.

The case when qlen > tlen + w is actually even worse for ksw_extd() as it does not pass off_end: https://github.com/lh3/ksw2/blob/4e0a1ccba8c6ccc87e0342c9712531bde783bf90/ksw2_extd.c#L170 This means that this condition https://github.com/lh3/ksw2/blob/4e0a1ccba8c6ccc87e0342c9712531bde783bf90/ksw2.h#L133 is never true so on this line https://github.com/lh3/ksw2/blob/4e0a1ccba8c6ccc87e0342c9712531bde783bf90/ksw2.h#L134 accesses a wrong element (never set or even out of bound) of p is accessed. To fix this issue one would have to generate off_end and pass it, which is a relatively easy task.

To test this I created a reproducer. It uses w==3 and ​qlen==6 and both query and target have all the same basepairs. For tlen==9 CIGAR has 3 deletions and 6 matches, as expected. For tlen==10 it generates 10 deletions and 6 insertions, in accordance with what I explained above, but significantly worse than 4 deletions and 6 matches it could generate. For ksw_extd2_sse() I did not manage to get the expected result because this check https://github.com/lh3/ksw2/blob/4e0a1ccba8c6ccc87e0342c9712531bde783bf90/ksw2_extd2_sse.c#L134 leads to a zdrop even when zdrop is disabled (zdrop < 0) (not sure if this is by design, but I am planning to file a separate issue). If this check is deleted the generated CIGAR has 4 deletions and 6 matches, as expected.

To run the reproducer copy the code from below into main.cpp and copy ksw2.h, ksw2_extd.c and ksw2_extd2_sse.c to the same folder. Then build with g++ --std=c++17 main.cpp ksw2_extd.c ksw2_extd2_sse.c -o ksw2_repro and execute with ./ksw2_repro

#include <iostream>
#include <vector>

#include "ksw2.h"

void print_cigar(const ksw_extz_t& ez)
{
    if (ez.zdropped == 0) std::cerr << "Not zdropped" << std::endl;
    else std::cout << "ZDROPPED!" << std::endl;

    std::cout << "CIGAR string (operation, length):" << std::endl;
    for(uint32_t cigar_idx = 0; cigar_idx < ez.n_cigar; ++cigar_idx)
    {
        const uint32_t operation = ez.cigar[cigar_idx] & 0xF;
        const uint32_t length    = ez.cigar[cigar_idx] >> 4;
        switch (operation)
        {
            case 0: std::cout << "match:     "; break;
            case 1: std::cout << "insertion: "; break;
            case 2: std::cout << "deletion:  "; break;
            case 3: std::cout << "intron:    "; break;
            default: std::cout << "UNKNOWN OPERATION! ";
        }
        std::cout << length << std::endl;
    }
}

int main()
{
    void* km = nullptr;

    const std::vector<uint8_t> query(6, 0);
    const std::vector<uint8_t> target(10, 0);

    std::cout << "Query length:  " << query.size() << std::endl;
    std::cout << "Target length: " << target.size() << std::endl << std::endl;

    constexpr int8_t m            = 5;
    const std::vector<int8_t> mat = {+2, -4, -4, -4, -1,
                                     -4, +2, -4, -4, -1,
                                     -4, -4, +2, -4, -1,
                                     -4, -4, -4, +2, -1,
                                     -1, -1, -1, -1, -1};

    constexpr int8_t gapo  = 4;
    constexpr int8_t gape  = 2;
    constexpr int8_t gapo2 = 24;
    constexpr int8_t gape2 = 1;

    constexpr int32_t w     = 3;
    constexpr int32_t zdrop = -1;
    constexpr int32_t flag  = 0;

    std::cout << "** Green's formulation (ksw_extd) **" << std::endl;

    ksw_extz_t ez_green;
    ksw_reset_extz(&ez_green);
    ez_green.m_cigar = 2;
    ez_green.cigar = (uint32_t*) malloc(ez_green.m_cigar * sizeof(uint32_t));
    ksw_extd(km,
            query.size(),
            query.data(),
            target.size(),
            target.data(),
            m,
            mat.data(),
            gapo,
            gape,
            gapo2,
            gape2,
            w,
            zdrop,
            flag,
            &ez_green);
    print_cigar(ez_green);

    std::cout << std::endl <<  "** Suzuki's formulation (ksw_extd2_sse) **" << std::endl;
    ksw_extz_t ez_suzuki;
    ksw_reset_extz(&ez_suzuki);
    ez_suzuki.m_cigar = 2;
    ez_suzuki.cigar = (uint32_t*) malloc(ez_suzuki.m_cigar * sizeof(uint32_t));
    ksw_extd2_sse(km,
                  query.size(),
                  query.data(),
                  target.size(),
                  target.data(),
                  m,
                  mat.data(),
                  gapo,
                  gape,
                  gapo2,
                  gape2,
                  w,
                  zdrop,
                  0,
                  flag,
                  &ez_suzuki);
    print_cigar(ez_suzuki);

    free(ez_green.cigar);
    free(ez_suzuki.cigar);

    return 0;
}

mimaric avatar May 12 '21 20:05 mimaric