Underestimation of alignment end position for long-reads
Samclip calculates alignment end position as alignment start position + length of read.
my $end = $start + length($sam[SAM_SEQ]) - 1;
This works fine for Illumina data, but often falls short of the true alignment length when dealing with long-reads that may contain many deletions relative to the reference. I expect that this will cause samclip to falsely exclude some long read alignments which are actually soft clipped at the 3' end of contigs.
I fixed this in teloclip by calculating alignment len (in reference) directly from the CIGAR string.
This tool was only designed for short reads really - and I hadn't considerd your use case. But you are exactly right! I will have a look at teloclip - I assume you just correct for I and D tags.
Yep, also potential splices and mismatches. See lenCIGAR function.
Ah yes, the infamous X and = operators. I've never seen them used in practice. Do any of the nanopore tools use them? For short reads they would make the SAM files way too big.
I haven't seen them in any of my data but figured I should support them just in case.