mavis icon indicating copy to clipboard operation
mavis copied to clipboard

Add support for svim and sniffles vcf as input

Open oneillkza opened this issue 4 years ago • 5 comments

Svim and sniffles are two SV callers specifically for long-read sequence data. It would be highly beneficial to be able to input them to MAVIS, both to cluster calls with each other, do somatic calling, and to integrate them with short-read sequence data.

My initial tests suggest that the 'vcf' input in MAVIS crashes for vcfs from both tools. This is somewhat unsurprising given the lack of standardisation for representing SVs in vcf format. So I'll undertake to create load scripts for the vcfs from these two tools.

oneillkza avatar May 15 '20 21:05 oneillkza

OK, so first issue, looking at sniffles vcfs, is that sniffles has an svtype "INV/DEL", eg:

2	232443764	3319_0	N	<DEL/INV>	.	PASS	PRECISE;SVMETHOD=Snifflesv1.0.10;CHR2=2;END=232443951;STD_quant_start=3.911521;STD_quant_stop=4.207137;Kurtosis_quant_start=4.480132;Kurtosis_quant_stop=-1.544728;SVTYPE=DEL/INV;SUPTYPE=NR;SVLEN=-187;STRANDS=+-;RE=16;REF_strand=7,5;AF=0.571429	GT:DR:DV	0/1:12:16

However, there are only two of those, and they both look like false positives/artifacts. I've also noticed that the same variants seem to get called in other samples. Having checked in a case where I know there is a combined inv/del event, that event is not called as DEL/INV, but the artifactual ones are.

It's also not clear how these events would fit into the vcf format, since the combined event has three breakpoints, and I believe vcf only allows for specifying two.

I think the correct behaviour would be to ignore lines with SVTYPE=DEL/INV.

oneillkza avatar May 16 '20 01:05 oneillkza

There's also a DUP/INS:

4	186894524	6309_0	N	<DUP/INS>	.	PASS	IMPRECISE;SVMETHOD=Snifflesv1.0.10;CHR2=4;END=186895121;STD_quant_start=25.670995;STD_quant_stop=316.374778;Kurtosis_quant_start=6.435367;Kurtosis_quant_stop=-1.961370;SVTYPE=DUP/INS;SUPTYPE=AL,SR;SVLEN=913;STRANDS=+-;RE=5;REF_strand=4,4;AF=0.384615	GT:DR:DV	0/1:8:5

which is a bit of a mess when I look in IGV since there does look like a real insertion, with maybe a real duplication, but they're in the middle of a poly(T) region. The variant reported seems to correspond in terms of breakpoints to the insertion. Also the insertion, when I BLASTed it, seemed to be a real germline variation reported in this paper:

https://www.ncbi.nlm.nih.gov/pubmed/28250455

However, there is only one of these, and again it may be easiest to ignore them, since it isn't going to be clear which of the variants the breakpoints are referring to. I'll also make a ticket over at the Sniffles repo about this.

This is also a note to myself: for insertions, Sniffles reports bp2 = bp1 + svlen. I guess this looks nicer in genome browsers, but will likely need correcting when loading in MAVIS.

oneillkza avatar May 16 '20 01:05 oneillkza

Lastly, sniffles uses the SVTYPE INVDUP, for an inverted duplication. This one at least should only have two breakpoints, and may be best to treat like a translocation.

Or ignored. The only places these are called in the COLO829 test data is in MT and GL000225.1/GL000220.1. It might be safe to assume that they are always artifactual.

oneillkza avatar May 16 '20 01:05 oneillkza

OK, got it to load in the rows by ignoring any unrecognised SVTYPEs.

Next is to add some checks/fixes for breakpoints being in the wrong order:

AttributeError: ('interval start > end is not allowed', 1, 0)

oneillkza avatar May 20 '20 01:05 oneillkza

Sniffles vcf conversion seems to all be working, and now has test coverage. I've merged that into the long read branch in #211

oneillkza avatar May 27 '20 18:05 oneillkza