seqan icon indicating copy to clipboard operation
seqan copied to clipboard

Add soft/hard clipping support to FragmentStore

Open holtgrewe opened this issue 11 years ago • 4 comments

The FragmentStore does neither support soft nor hard clipping.

For soft-clipping, the full sequence is in the SEQ field of SAM files and the anchor gaps data structure supports clipping with anchors. This should not be a problem.

For hard clipping, one would have to assume that the full sequence can be loaded into RAM (FragmentStore is not meant for genome alignment). When loading, one could fill the hard clipped region with N characters and when another region is found for the same sequence one could replace N by the actual sequence characters.

holtgrewe avatar Oct 04 '14 12:10 holtgrewe

@esiragusa We should talk about this somewhat more and propose an improvement.

holtgrewe avatar Oct 04 '14 12:10 holtgrewe

Beware, this is not just an enhancement but there is a hidden bug in this issue. In compliance with the SAM format specification (http://samtools.github.io/hts-specs/SAMv1.pdf) the 4th column in a SAM file is the position of the first MAPPED position. Which means that when you perform local alignments with Bowtie2, BWA, etc... you will get most of the time soft-clipped alignments like this: read1 0 ref1 42 0 4S90M6S ... which means that the read1 aligned against ref1 with the 5th nucleotide from read1 starting at position 42 of ref1. However, when such a sam alignment is read into the FragmentStore, you will get the read1 aligned against ref1 with the 1st nucleotide from read1 starting at position 42 of ref1. This is a bug... Probably that some quick and dirty fix could be applyied to resolve this bug, before thinking of a complete support for soft and hard clipping.

ppericard avatar Oct 15 '14 13:10 ppericard

The quick and dirty fix could be to convert/treat e.g. 6S as 6M.

esiragusa avatar Oct 15 '14 13:10 esiragusa

Yes, that would work when your alignment is nearly global and you dont care about the few soft-clipped nucleotides. It's usually what you get with low-error reads like Illumina's. But, when you work with reads with more errors (like in 454), or you align against distant ref, you get CIGAR like this: 300S40M1D30M200S In this case, considering the soft-clipped nucleotides as mismatch doesn't make sense, because you completely loose the alignement information about the 71 aligned nucleotides in this case. I was more thinking of something like, when you get an alignment from the FragmentStore and that this alignment was soft-clipped, the read sequence you get, and the gap object that goes with it, would be trimmed to the start of the alignment and go until the end of the alignment. Would it be feasible ?

ppericard avatar Oct 15 '14 13:10 ppericard