samclip icon indicating copy to clipboard operation
samclip copied to clipboard

Underestimation of alignment end position for long-reads

Open Adamtaranto opened this issue 7 years ago • 4 comments

Samclip calculates alignment end position as alignment start position + length of read.

my $end = $start + length($sam[SAM_SEQ]) - 1;

This works fine for Illumina data, but often falls short of the true alignment length when dealing with long-reads that may contain many deletions relative to the reference. I expect that this will cause samclip to falsely exclude some long read alignments which are actually soft clipped at the 3' end of contigs.

I fixed this in teloclip by calculating alignment len (in reference) directly from the CIGAR string.

Adamtaranto avatar Aug 23 '18 19:08 Adamtaranto

This tool was only designed for short reads really - and I hadn't considerd your use case. But you are exactly right! I will have a look at teloclip - I assume you just correct for I and D tags.

tseemann avatar Aug 25 '18 04:08 tseemann

Yep, also potential splices and mismatches. See lenCIGAR function.

Adamtaranto avatar Aug 25 '18 08:08 Adamtaranto

Ah yes, the infamous X and = operators. I've never seen them used in practice. Do any of the nanopore tools use them? For short reads they would make the SAM files way too big.

tseemann avatar Aug 25 '18 22:08 tseemann

I haven't seen them in any of my data but figured I should support them just in case.

Adamtaranto avatar Aug 26 '18 06:08 Adamtaranto