svaba
svaba copied to clipboard
Duplicated breakend IDs
Hi there,
Thanks for the great software!
I have found that some entries in the output VCF have duplicate IDs. This is one example but it is happening in most of our samples:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT normal.bam tumour.bam chr10 38825015 129432969:1 G ]chr17:26809867]G 0 LOWMAPQDISC DISC_MAPQ=255;EVDNC=DSCRD;IMPRECISE;MAPQ=-1;MATEID=129432969:2;MATENM=-1;NM=-1;NUMPARTS=0;SCTG=chr10:38825015(-)-chr17:26809867(+)__17_26803001_26828001D;SPAN=-1;SVTYPE=BND GT:AD:DP:GQ:PL:SR:DR:LR:LO 0/0:0:64:17.7:0,17.7,194.7:0:0:17.98:0 0/0:3:171:37.9:0,37.9,505:0:3:38.14:3.468 chr15 17082264 129432969:1 A A[chr17:21933248[ 3 LOWMAPQDISC DISC_MAPQ=255;EVDNC=DSCRD;IMPRECISE;MAPQ=-1;MATEID=129432969:2;MATENM=-1;NM=-1;NUMPARTS=0;SCTG=chr15:17082264(+)-chr17:21933248(-)__17_21927501_21952501D;SPAN=-1;SVTYPE=BND GT:AD:DP:GQ:PL:SR:DR:LR:LO 0/0:0:15:4.2:0,4.2,46.2:0:0:4.214:0 0/1:4:35:3.5:3.5,0,82.7:0:4:-3.37:7.935 chr17 21933248 129432969:2 A ]chr15:17082264]A 3 LOWMAPQDISC DISC_MAPQ=13;EVDNC=DSCRD;IMPRECISE;MAPQ=13;MATEID=129432969:1;MATENM=-1;NM=-1;NUMPARTS=0;SCTG=chr15:17082264(+)-chr17:21933248(-)__17_21927501_21952501D;SPAN=-1;SVTYPE=BND GT:AD:DP:GQ:PL:SR:DR:LR:LO 0/0:0:15:4.2:0,4.2,46.2:0:0:4.214:0 0/1:4:35:3.5:3.5,0,82.7:0:4:-3.37:7.935 chr17 26809867 129432969:2 T T[chr10:38825015[ 0 LOWMAPQDISC DISC_MAPQ=5;EVDNC=DSCRD;IMPRECISE;MAPQ=5;MATEID=129432969:1;MATENM=-1;NM=-1;NUMPARTS=0;SCTG=chr10:38825015(-)-chr17:26809867(+)__17_26803001_26828001D;SPAN=-1;SVTYPE=BND GT:AD:DP:GQ:PL:SR:DR:LR:LO 0/0:0:64:17.7:0,17.7,194.7:0:0:17.98:0 0/0:3:171:37.9:0,37.9,505:0:3:38.14:3.468
Would this be expected behaviour? If so, please can you explain what the interpretation of these duplicated IDs would be? It is causing a few problems in our downstream analysis.
Thanks very much!
Oy this is a bug... I think what is happening is that I am dedicating 30 bits of space for that ID but then assigning it a full 32 bits, so my hash-collision-check is failing because the 32 bits that get checked actually roll over into who-knows-what in the 30 bit storage.
Anyway, I just made a branch "issue92" which should fix this. Could you clone and recompile and run that branch? I think this will work.
Thanks so much for putting in a fix so quickly. We will try it out and report back.
I presume it would be safe to take the VCF output from the samples we have already run, and replace one of the duplicate IDs with a new one? The reason I ask is because we have already run hundreds of WGS samples that have had this issue, so it would be nice to avoid re-running them all from scratch.
You could try the infrequently used “svaba refilter” which takes the breakpoint file and converts it to a VCF. This is what svaba does internally at the end of a run. The breakpoint file contains all of the information from a svaba run and wouldn’t be changed by this fix. It’s the conversion to vcf, which is fast, that is where the bug surfaced.
On Nov 11, 2020, at 11:46 PM, kateeasoncrukci [email protected] wrote:
Thanks so much for putting in a fix so quickly. We will try it out and report back.
I presume it would be safe to take the VCF output from the samples we have already run, and replace one of the duplicate IDs with a new one? The reason I ask is because we have already run hundreds of WGS samples that have had this issue, so it would be nice to avoid re-running them all from scratch.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
Pleased to report we have tested the new branch on one sample and now have no duplicate IDs. Thanks again for the quick fix.
Haven't had a chance to try svaba refilter in isolation yet but presumably should work - will let you know any further issues.
Hi again,
I've just managed to test using svaba refilter to remake the VCF files previously affected by the ID issue.
Unfortunately it results in this error:
svaba refilter -i sample.svaba.bps.txt.gz -b sample.bam -g genome.fa
Input bps file: sample.svaba.bps.txt.gz
Output bps file: refilter.filtered.bps.txt.gz
Panel of normals file:
Indel mask BED:
Analysis id: refilter
LOD cutoff (non-REF): 8
LOD cutoff (non-REF, at DBSNP): 6
LOD somatic cutoff: 6
LOD somatic cutoff (at DBSNP): 10
DBSNP Database file:
...read input bps / write output bps file at line 0
svaba: refilter.cpp:206: void runRefilterBreakpoints(int, char**): Assertion `val.at(0) == 't' || val.at(0) == 'n'' failed.
Aborted
It looks like maybe the columns of the bps file are not in the position expected?
1 chr1
2 pos1
3 strand1
4 chr2
5 pos2
6 strand2
7 ref
8 alt
9 span
10 mapq1
11 mapq2
12 nm1
13 nm2
14 disc_mapq1
15 disc_mapq2
16 split
17 cigar
18 alt
19 cov
20 sub_n1
21 sub_n2
22 homology
23 insertion
24 contig
25 numalign
26 confidence
27 evidence
28 quality
29 secondary_alignment
30 somatic_score
31 somatic_lod
32 tlod
33 pon_samples
34 repeat_seq
35 graylist
36 DBSNP
37 reads
38 bxtags
39 n000_PBCP_0005_N.recal.bam
40 t001_PBCP_0005_T.recal.bam
Thanks again for your help.
I am getting the same error. Is there any solution that you would recommend ? Thanks so much !
svaba refilter -i $file -b EA5040566.EA5040598.svaba.contigs.bam -g hg38.fa Input bps file: EA5040545.EA5040598.svaba.r2.bps.txt Output bps file: refilter.filtered.bps.txt.gz Panel of normals file: Indel mask BED: Analysis id: refilter LOD cutoff (non-REF): 8 LOD cutoff (non-REF, at DBSNP): 6 LOD somatic cutoff: 6 LOD somatic cutoff (at DBSNP): 10 DBSNP Database file: [W::bam_hdr_read] EOF marker is absent. The input is probably truncated ...read input bps / write output bps file at line 0 svaba: refilter.cpp:206: void runRefilterBreakpoints(int, char**): Assertion `val.at(0) == 't' || val.at(0) == 'n'' failed
I think the issues are now related just to refilter
, which was a sort of back-end convenience function I used at one point. To keep the maintenance simpler I've decided not to support this given time constraints -- but would welcome a PR. The original bug with the duplicated breakend IDs should be solved for svaba run
.