strobealign icon indicating copy to clipboard operation
strobealign copied to clipboard

Very short soft-clipped ends

Open marcelm opened this issue 1 year ago • 1 comments

All the reads in the phiX test dataset happen to start with a single N base (an artifact of picking the first 100 reads from the run and not random ones). For a read that otherwise matches without errors, StrobeAlign reports the alignment as 1S300=.

I found this to be unexpected because BWA-MEM reports this as 301M (with the N considered to be a mismatch as one can see from the MD tag). BWA-MEM penalizes soft-clipping (option -L) with a default penalty of 5, so it’ll prefer (at most) one mismatch (penalty 4) over soft clipping.

On the other hand, minimap2 also soft clips and reports 1S300M.

I think that penalizing soft clipping is beneficial when aligning short reads. It is not important for shotgun sequencing, but for targeted sequencing (amplicons), soft clipping single bases introduces a bias: Any variation at that position in the reference cannot be observed. For minimap2, it’s not so important because it is primarily (AFAIK) for longer reads.

This is probably not a high-priority issue, but I wanted to at least write it down because I was suprised when inspecting the test BAM output.

marcelm avatar Sep 07 '22 13:09 marcelm

I agree with this. Probably an artefact of using SSW local alignment mode. I used ksw2 before, but found its extension mode to occasionally yield some strange alignments over indels, hence switched (anecdata, unfortunately didn't log the event anyware). Having the right third party extension alignment tool is room for future work.

ksahlin avatar Sep 09 '22 07:09 ksahlin