svaba icon indicating copy to clipboard operation
svaba copied to clipboard

Duplicated breakend IDs

Open kateeasoncrukci opened this issue 4 years ago • 5 comments

Hi there,

Thanks for the great software!

I have found that some entries in the output VCF have duplicate IDs. This is one example but it is happening in most of our samples:

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	normal.bam	tumour.bam
chr10	38825015	129432969:1	G	]chr17:26809867]G	0	LOWMAPQDISC	DISC_MAPQ=255;EVDNC=DSCRD;IMPRECISE;MAPQ=-1;MATEID=129432969:2;MATENM=-1;NM=-1;NUMPARTS=0;SCTG=chr10:38825015(-)-chr17:26809867(+)__17_26803001_26828001D;SPAN=-1;SVTYPE=BND	GT:AD:DP:GQ:PL:SR:DR:LR:LO	0/0:0:64:17.7:0,17.7,194.7:0:0:17.98:0	0/0:3:171:37.9:0,37.9,505:0:3:38.14:3.468
chr15	17082264	129432969:1	A	A[chr17:21933248[	3	LOWMAPQDISC	DISC_MAPQ=255;EVDNC=DSCRD;IMPRECISE;MAPQ=-1;MATEID=129432969:2;MATENM=-1;NM=-1;NUMPARTS=0;SCTG=chr15:17082264(+)-chr17:21933248(-)__17_21927501_21952501D;SPAN=-1;SVTYPE=BND	GT:AD:DP:GQ:PL:SR:DR:LR:LO	0/0:0:15:4.2:0,4.2,46.2:0:0:4.214:0	0/1:4:35:3.5:3.5,0,82.7:0:4:-3.37:7.935
chr17	21933248	129432969:2	A	]chr15:17082264]A	3	LOWMAPQDISC	DISC_MAPQ=13;EVDNC=DSCRD;IMPRECISE;MAPQ=13;MATEID=129432969:1;MATENM=-1;NM=-1;NUMPARTS=0;SCTG=chr15:17082264(+)-chr17:21933248(-)__17_21927501_21952501D;SPAN=-1;SVTYPE=BND	GT:AD:DP:GQ:PL:SR:DR:LR:LO	0/0:0:15:4.2:0,4.2,46.2:0:0:4.214:0	0/1:4:35:3.5:3.5,0,82.7:0:4:-3.37:7.935
chr17	26809867	129432969:2	T	T[chr10:38825015[	0	LOWMAPQDISC	DISC_MAPQ=5;EVDNC=DSCRD;IMPRECISE;MAPQ=5;MATEID=129432969:1;MATENM=-1;NM=-1;NUMPARTS=0;SCTG=chr10:38825015(-)-chr17:26809867(+)__17_26803001_26828001D;SPAN=-1;SVTYPE=BND	GT:AD:DP:GQ:PL:SR:DR:LR:LO	0/0:0:64:17.7:0,17.7,194.7:0:0:17.98:0	0/0:3:171:37.9:0,37.9,505:0:3:38.14:3.468

Would this be expected behaviour? If so, please can you explain what the interpretation of these duplicated IDs would be? It is causing a few problems in our downstream analysis.

Thanks very much!

kateeasoncrukci avatar Nov 09 '20 16:11 kateeasoncrukci

Oy this is a bug... I think what is happening is that I am dedicating 30 bits of space for that ID but then assigning it a full 32 bits, so my hash-collision-check is failing because the 32 bits that get checked actually roll over into who-knows-what in the 30 bit storage.

Anyway, I just made a branch "issue92" which should fix this. Could you clone and recompile and run that branch? I think this will work.

walaj avatar Nov 11 '20 23:11 walaj

Thanks so much for putting in a fix so quickly. We will try it out and report back.

I presume it would be safe to take the VCF output from the samples we have already run, and replace one of the duplicate IDs with a new one? The reason I ask is because we have already run hundreds of WGS samples that have had this issue, so it would be nice to avoid re-running them all from scratch.

kateeasoncrukci avatar Nov 12 '20 09:11 kateeasoncrukci

You could try the infrequently used “svaba refilter” which takes the breakpoint file and converts it to a VCF. This is what svaba does internally at the end of a run. The breakpoint file contains all of the information from a svaba run and wouldn’t be changed by this fix. It’s the conversion to vcf, which is fast, that is where the bug surfaced.

On Nov 11, 2020, at 11:46 PM, kateeasoncrukci [email protected] wrote:

 Thanks so much for putting in a fix so quickly. We will try it out and report back.

I presume it would be safe to take the VCF output from the samples we have already run, and replace one of the duplicate IDs with a new one? The reason I ask is because we have already run hundreds of WGS samples that have had this issue, so it would be nice to avoid re-running them all from scratch.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

walaj avatar Nov 13 '20 18:11 walaj

Pleased to report we have tested the new branch on one sample and now have no duplicate IDs. Thanks again for the quick fix.

Haven't had a chance to try svaba refilter in isolation yet but presumably should work - will let you know any further issues.

kateeasoncrukci avatar Nov 23 '20 13:11 kateeasoncrukci

Hi again,

I've just managed to test using svaba refilter to remake the VCF files previously affected by the ID issue.

Unfortunately it results in this error:

svaba refilter -i sample.svaba.bps.txt.gz -b sample.bam -g genome.fa
Input bps file:  sample.svaba.bps.txt.gz 
Output bps file: refilter.filtered.bps.txt.gz
Panel of normals file: 
Indel mask BED:      
Analysis id: refilter
    LOD cutoff (non-REF):            8
    LOD cutoff (non-REF, at DBSNP):  6
    LOD somatic cutoff:              6
    LOD somatic cutoff (at DBSNP):   10
    DBSNP Database file: 
...read input bps / write output bps file at line 0
svaba: refilter.cpp:206: void runRefilterBreakpoints(int, char**): Assertion `val.at(0) == 't' || val.at(0) == 'n'' failed.
Aborted

It looks like maybe the columns of the bps file are not in the position expected?

1 chr1
2 pos1
3 strand1
4 chr2
5 pos2
6 strand2
7 ref
8 alt
9 span
10 mapq1
11 mapq2
12 nm1
13 nm2
14 disc_mapq1
15 disc_mapq2
16 split
17 cigar
18 alt
19 cov
20 sub_n1
21 sub_n2
22 homology
23 insertion
24 contig
25 numalign
26 confidence
27 evidence
28 quality
29 secondary_alignment
30 somatic_score
31 somatic_lod
32 tlod
33 pon_samples
34 repeat_seq
35 graylist
36 DBSNP
37 reads
38 bxtags
39 n000_PBCP_0005_N.recal.bam
40 t001_PBCP_0005_T.recal.bam

Thanks again for your help.

kateeasoncrukci avatar Jan 07 '21 15:01 kateeasoncrukci

I am getting the same error. Is there any solution that you would recommend ? Thanks so much !

svaba refilter -i $file -b EA5040566.EA5040598.svaba.contigs.bam -g hg38.fa Input bps file: EA5040545.EA5040598.svaba.r2.bps.txt Output bps file: refilter.filtered.bps.txt.gz Panel of normals file: Indel mask BED: Analysis id: refilter LOD cutoff (non-REF): 8 LOD cutoff (non-REF, at DBSNP): 6 LOD somatic cutoff: 6 LOD somatic cutoff (at DBSNP): 10 DBSNP Database file: [W::bam_hdr_read] EOF marker is absent. The input is probably truncated ...read input bps / write output bps file at line 0 svaba: refilter.cpp:206: void runRefilterBreakpoints(int, char**): Assertion `val.at(0) == 't' || val.at(0) == 'n'' failed

tanasa avatar Apr 30 '23 22:04 tanasa

I think the issues are now related just to refilter, which was a sort of back-end convenience function I used at one point. To keep the maintenance simpler I've decided not to support this given time constraints -- but would welcome a PR. The original bug with the duplicated breakend IDs should be solved for svaba run.

walaj avatar May 15 '23 16:05 walaj