Sniffles icon indicating copy to clipboard operation
Sniffles copied to clipboard

Overlapping deletions/insertions

Open SAMtoBAM opened this issue 2 years ago • 8 comments

Hi there,

I have noticed several times that there will be a cluster of small-ish deletions or insertions that appear to frequently overlap one another. What I can say with some certainty, based on assemblies, is that within these region there is generally just a single event covering the entire region labelled by all these small-ish calls.

For example:

chr1 | 160176 | Sniffles2.DEL.9AS0 | N | <DEL> | 60 | GT | PRECISE;SVTYPE=DEL;SVLEN=-174;END=160350;SUPPORT=3;COVERAGE=20,13,13,13,13;STRAND=+-;AF=0.231;STDEV_LEN=6.557;STDEV_POS=6.351 | GT:GQ:DR:DV | 0/1:2:10:3
chr1 | 160246 | Sniffles2.DEL.99S0 | N | <DEL> | 60 | PASS | IMPRECISE;SVTYPE=DEL;SVLEN=-105;END=160351;SUPPORT=5;COVERAGE=20,13,13,13,13;STRAND=+-;AF=0.385;STDEV_LEN=8.173;STDEV_POS=89.196 | GT:GQ:DR:DV | 0/1:27:8:5
chr1 | 160589 | Sniffles2.DEL.9CS0 | N | <DEL> | 60 | PASS | IMPRECISE;SVTYPE=DEL;SVLEN=-572;END=161161;SUPPORT=7;COVERAGE=13,13,13,13,13;STRAND=+-;AF=0.538;STDEV_LEN=31.525;STDEV_POS=29.533 | GT:GQ:DR:DV | 0/1:40:6:7
chr1 | 161141 | Sniffles2.DEL.A6S0 | N | <DEL> | 60 | GT | IMPRECISE;SVTYPE=DEL;SVLEN=-2810;END=163951;SUPPORT=3;COVERAGE=13,13,13,13,13;STRAND=-;AF=0.231;STDEV_LEN=286.728;STDEV_POS=133.910 | GT:GQ:DR:DV | 0/1:2:10:3
chr1 | 161242 | Sniffles2.DEL.ACS0 | N | <DEL> | 60 | PASS | PRECISE;SVTYPE=DEL;SVLEN=-4628;END=165870;SUPPORT=4;COVERAGE=13,13,13,13,13;STRAND=+;AF=0.308;STDEV_LEN=0.000;STDEV_POS=0.000 | GT:GQ:DR:DV | 0/1:14:9:4
chr1 | 163495 | Sniffles2.DEL.ABS0 | N | <DEL> | 60 | GT | IMPRECISE;SVTYPE=DEL;SVLEN=-2419;END=165914;SUPPORT=3;COVERAGE=13,13,13,13,18;STRAND=+-;AF=0.231;STDEV_LEN=369.793;STDEV_POS=99.594 | GT:GQ:DR:DV | 0/1:2:10:3
chr1 | 165058 | Sniffles2.DEL.AAS0 | N | <DEL> | 60 | PASS | IMPRECISE;SVTYPE=DEL;SVLEN=-869;END=165927;SUPPORT=4;COVERAGE=13,13,13,13,18;STRAND=+-;AF=0.308;STDEV_LEN=111.723;STDEV_POS=9.192 | GT:GQ:DR:DV | 0/1:14:9:4
chr1 | 166033 | Sniffles2.DEL.ADS0 | N | <DEL> | 60 | PASS | PRECISE;SVTYPE=DEL;SVLEN=-115;END=166148;SUPPORT=9;COVERAGE=13,13,13,13,18;STRAND=+-;AF=0.692;STDEV_LEN=0.000;STDEV_POS=0.000 | GT:GQ:DR:DV | 0/1:14:4:9

Above there are several deletions covering the regions 160176-165927 on chromosome 1, which is more simply just a single deletion of that entire region.

Are all these small calls and their overlaps informative or an issue?

Thanks a lot

SAMtoBAM avatar Feb 10 '22 07:02 SAMtoBAM

Hello,

this may be caused by deletions being split up within an alignment across highly repetitive regions. We recommend running Sniffles2 with a tandem repeat annotations .bed file (which can be downloaded from the Sniffles GitHub, and supplied to --tandem-repeats), which will allow differentiated clustering across these regions.

Hope this helps! Moritz

smolkmo avatar Feb 10 '22 10:02 smolkmo

Although it is a repetitive region, it doesn't technically contain tandem repeats. The region is a isolated transposon which contains inverted repeats on either end but they are only 300bp and separated by ~5.5kb. Do you think providing a bed file highlighting these regions would still help?

SAMtoBAM avatar Feb 10 '22 23:02 SAMtoBAM

Sniffles uses different clustering parameters in the repeat regions to account for SVs being split up into multiple smaller events by read aligners - so yes, even though you are not technically dealing with tandem repeats, providing the regions as .bed file may help. Whether they will be detected as single SV in the end will however depend on multiple factors, including the standard deviation of the start positions of the individual SV signals. Our benchmarks are of course only for tandem repeats. Just as a tip in case you want to provide your own .bed file: I would recommend a single entry for the full region and to make sure it is coordinate sorted before supplying it to Sniffles.

Best, Moritz

smolkmo avatar Feb 11 '22 11:02 smolkmo

Hi Moritz, Thanks for the help here

So I tried what you suggested, added a bed file (below) to the calling process encompassing the entire region containing the transposon. Some results below if you are interested.

What happened is a bit strange. A large deletion now is called prior to the actual deletion site and only the last deletion call remains within the deleted region from the non-bed file call set.

bed:

chr1 160176 165927

vcf:

chr1	154952	Sniffles2.DEL.78S0	N	<DEL>	60	PASS	IMPRECISE;SVTYPE=DEL;SVLEN=-5399;END=160351;SUPPORT=12;COVERAGE=19,19,20,13,13;STRAND=+-;AF=0.706;STDEV_LEN=13.248;STDEV_POS=57.913	GT:GQ:DR:DV	0/1:16:5:12
chr1	166033	Sniffles2.DEL.79S0	N	<DEL>	60	PASS	PRECISE;SVTYPE=DEL;SVLEN=-115;END=166148;SUPPORT=12;COVERAGE=13,13,13,13,18;STRAND=+-;AF=0.923;STDEV_LEN=4.806;STDEV_POS=0.000	GT:GQ:DR:DV	1/1:23:1:12

I then tried adjusting the region within the bed file. Increasing the borders +-2kb didn't change anything However decreasing the size +-2kb did. The deletion prior to the actual deletion is now smaller. The last deletion remains. Plus there are a few overlapping calls as before although fewer.

bed:

chr1 162000 164000

vcf:

chr1	158730	Sniffles2.DEL.7AS0	N	<DEL>	60	PASS	IMPRECISE;SVTYPE=DEL;SVLEN=-2497;END=161227;SUPPORT=12;COVERAGE=19,19,13,13,13;STRAND=+-;AF=0.800;STDEV_LEN=1102.503;STDEV_POS=203.876	GT:GQ:DR:DV	1/1:3:3:12
chr1	160233	Sniffles2.DEL.78S0	N	<DEL>	60	PASS	IMPRECISE;SVTYPE=DEL;SVLEN=-118;END=160351;SUPPORT=8;COVERAGE=20,13,13,13,13;STRAND=+-;AF=0.615;STDEV_LEN=14.716;STDEV_POS=63.604	GT:GQ:DR:DV	0/1:27:5:8
chr1	161242	Sniffles2.DEL.80S0	N	<DEL>	60	PASS	PRECISE;SVTYPE=DEL;SVLEN=-4628;END=165870;SUPPORT=4;COVERAGE=13,13,13,13,13;STRAND=+;AF=0.308;STDEV_LEN=0.000;STDEV_POS=0.000	GT:GQ:DR:DV	0/1:14:9:4
chr1	163495	Sniffles2.DEL.7FS0	N	<DEL>	60	GT	IMPRECISE;SVTYPE=DEL;SVLEN=-2419;END=165914;SUPPORT=3;COVERAGE=13,13,13,13,18;STRAND=+-;AF=0.231;STDEV_LEN=369.793;STDEV_POS=99.594	GT:GQ:DR:DV	0/1:2:10:3
chr1	165058	Sniffles2.DEL.7ES0	N	<DEL>	60	PASS	IMPRECISE;SVTYPE=DEL;SVLEN=-869;END=165927;SUPPORT=4;COVERAGE=13,13,13,13,18;STRAND=+-;AF=0.308;STDEV_LEN=111.723;STDEV_POS=9.192	GT:GQ:DR:DV	0/1:14:9:4
chr1	166033	Sniffles2.DEL.81S0	N	<DEL>	60	PASS	PRECISE;SVTYPE=DEL;SVLEN=-115;END=166148;SUPPORT=9;COVERAGE=13,13,13,13,18;STRAND=+-;AF=0.692;STDEV_LEN=0.000;STDEV_POS=0.000	GT:GQ:DR:DV	0/1:14:4:9

Anything I could adjust? most importantly to get rid of this false positive deletion at the beginning of the call? Alternatively, I think I'll just merge overlapping deletions/insertion call without using a bed file. Anything potentially problematic with this? I am testing this on homozygous samples so there should not be an issue with haplotypes.

Thanks again

SAMtoBAM avatar Feb 11 '22 22:02 SAMtoBAM

Hello again!

it seems from your results that your region may not benefit from this mode, which is designed to handle simple repeats, as you noted. In some cases such as the one you are seeing, the alignments will not allow Sniffles to merge SVs - because there is no (or not enough) signal to distinguish between multiple SVs versus a single large one. This can be either due to the aligner used, or more generally due to the structure of the region.

If you are certain that this is one SV biologically, I would also suggest manually merging it. We are however planning to introduce postprocessing scripts for different biological entities (such as transposons) that may lead to complex SVs in the future.

Thanks and best regards, Moritz

smolkmo avatar Feb 14 '22 11:02 smolkmo

Awesome, I'll keep an eye out for this post-processing script For now, at least in these simple homozygous cases, all the overlapping chunks appear to contain simple, single, large events so it looks safe to merge them.

Definitely could be aligner specific. In my case I have used minimap2...so perhaps it would be worthwhile testing NGMLR (if I had the time), especially considering if my memory serves me correct that it was performing better at split read alignment, perhaps better allowing reads to span large deletions/insertions.

Thanks for the help (i'll leave you to close the comment if you are happy to).

SAMtoBAM avatar Feb 15 '22 00:02 SAMtoBAM

Me again, Just to note, I just tried it with NGMLR, as the overlap aggregation of the minimap2 calls didn't work as well as I hoped, and sadly using NGMLR did not solve the problem. In fact now with NGMLR there are a lot of enormous inversion calls which appear around these same regions now. Thanks again

SAMtoBAM avatar Feb 15 '22 02:02 SAMtoBAM

Sounds like your region also leads to disagreement between mappers in terms of how they split up read alignments - this can happen in regions containing complex / fragmented SVs. If you know the biological ground truth, a specific postprocessing script for these cases is probably the best solution, but also requires understanding what exactly happens with the read alignments in those regions (why they get fragmented) and may not deliver the best results, as you noted. If your goal is to find a way to detect other transposons of this kind, what also could be worth a try is identifying a shared signature that they leave in the read alignments (and successive SV calls).

smolkmo avatar Feb 16 '22 11:02 smolkmo