pufferfish icon indicating copy to clipboard operation
pufferfish copied to clipboard

Implementation of soft-clipping and CIGAR computation

Open haghshenas opened this issue 3 years ago • 2 comments

Here is a short summary of the changes:

  • [x] Enabling soft-clipping
    • Proper use of KSW2 alignment results: Track the best scoring prefix alignment (during KSW2 extension) in addition to full alignment of query or target
    • Reducing end_bonus value from 10 to 5
    • Setting the alignment start coordinate accordingly
  • [x] Computing CIGAR strings:
    • Setting KSW_EZ_EXTZ_ONLY flag for KSW2 extension alignments (otherwise it always backtrack from the end of both sequences)
    • Modifying CIGARGenerator's variable type to enable computation of CIGAR for cases with overlapping MEMs (uint32_t to int32_t)
    • Updating get_cigar function and removing unnecessary codes
  • [x] Logging more organized debug information for alignment validation
  • [x] Improving the command-line interface
    • replacing --allowSoftclip and --allowOverhangSoftclip with --computeCIGAR
    • adding a --debug; if not used, no debug information will be printed

With these changes I compared the output of PuffAligner with Minimap2 for single-end reads. This comparison showed that generally, the alignment is computed correctly in most of the cases.

The next essential step: In some cases the alignment is wrong. The reason is that the extension is stopped (ez->stopped==1) and therefore, the score for reaching the end of query is not calculated properly. As a result, it might be the case that score for reaching end of query is incorrectly higher than the maximum scoring prefix alignment. Here is an example:

pufferfish alignment
TCAAAAACATGATTACAGGGACATCTCAGGCTGACTGTGCTGTCCTGATTGTTGCTGCTGGTGTTGGTGAATTTGAAGCTGGTATCTCCAAGAATGGGCN
||||||||||||||||||||||||||||||||||||x|||||||||||||||||||||||||||||||||||||||||x|x|x|||x|||x||||||||x
TCAAAAACATGATTACAGGGACATCTCAGGCTGACTCTGCTGTCCTGATTGTTGCTGCTGGTGTTGGTGAATTTGAAGGTAGCATCCCCAGGAATGGGCA

minimap alignment
TCAAAAACATGATTACAGGGACATCTCAGGCTGACTGTGCTGTCCTGATTGTTGCTGCTGGTGTTGGTGAATTTGAAG----------------------
||||||||||||||||||||||||||||||||||||x|||||||||||||||||||||||||||||||||||||||||
TCAAAAACATGATTACAGGGACATCTCAGGCTGACTCTGCTGTCCTGATTGTTGCTGCTGGTGTTGGTGAATTTGAAGGTAGCATCCCCAGGAATGGGCA

It seems that extension stopping was not written with soft-clipping and computation of CIGAR strings in mind, so this requires a fix. I will check if the original KSW2 extension fixes this issue and if it does, we can discuss about whether it's worth to have the stopping feature or not.

So next items:

  • [x] Checking if using original KSW2 extension fixes edge case wrong alignments
  • [x] Testing and verifying alignment coordinate, soft-clipping and CIGAR string for overhanging reads
  • [x] Testing for PE datasets
  • [x] Adding an option for passing end_bonus value
  • [x] Code cleaning (currently there is a lot of commented code in this branch)

haghshenas avatar Apr 07 '21 00:04 haghshenas

I looked at the four commits and all the changes look great!

I tested the soft-clipping with a couple of small examples and I found a case with the end2end mode where we don't expect to see any soft-clipped alignment, but we do. Here is the example:

query:
TCAAAAACATGATTACAGGGACATCTCAGGCTGACTGTGCTGTCCTGATTGTTGCTGCTGGTGTTGGTGAATTTGAAGCTGGTATCTCCAAGAATGGGCTGTGTGTGTGTGTCCCCCCC
target:
TCAAAAACATGATTACAGGGACATCTCAGGCTGACTGTGCTGTCCTGATTGTTGCTGCTGGTGTTGGTGAATTTGAAGCTGGTATCTCCAAGAATGGGCTGTGTGTGTGTGT
or
TCAAAAACATGATTACAGGGACATCTCAGGCTGACTGTGCTGTCCTGATTGTTGCTGCTGGTGTTGGTGAATTTGAAGCTGGTATCTCCAAGAATGGGCTGTGTGTGTGTGTT

For these targets, when running the command pufferfish align -i new.index/ --read read.fa -o out.sam --end2end results in the aligning the read with soft-clipped bases: 113M6S and 112M7S Maybe, in this case, we wanna report insertions (I) instead of (S).

Also in the case of running without the end2end flag, the alignment scores seem wrong.

I have some fixes which I can commit, but we can also have a discussion before going forward.

mohsenzakeri avatar Jun 11 '21 04:06 mohsenzakeri

Based on my review and the conversations we had about the changes before, everything looks good to me.

mohsenzakeri avatar Aug 09 '21 20:08 mohsenzakeri