samclip Underestimation of alignment end position for long-reads

Samclip calculates alignment end position as alignment start position + length of read.

my $end = $start + length($sam[SAM_SEQ]) - 1;

This works fine for Illumina data, but often falls short of the true alignment length when dealing with long-reads that may contain many deletions relative to the reference. I expect that this will cause samclip to falsely exclude some long read alignments which are actually soft clipped at the 3' end of contigs.

I fixed this in teloclip by calculating alignment len (in reference) directly from the CIGAR string.

Aug 23 '18 19:08 Adamtaranto

This tool was only designed for short reads really - and I hadn't considerd your use case. But you are exactly right! I will have a look at teloclip - I assume you just correct for I and D tags.

Aug 25 '18 04:08 tseemann

Yep, also potential splices and mismatches. See lenCIGAR function.

Aug 25 '18 08:08 Adamtaranto

Ah yes, the infamous X and = operators. I've never seen them used in practice. Do any of the nanopore tools use them? For short reads they would make the SAM files way too big.

Aug 25 '18 22:08 tseemann

I haven't seen them in any of my data but figured I should support them just in case.

Aug 26 '18 06:08 Adamtaranto