dipcall
dipcall copied to clipboard
Interpretation of N's in the VCF
I'm seeing N's in my dipcall v0.3 vcf where there are not Ns in the reference.
The variants lie within the high confidence bedfile.
Any idea what could be going wrong?
tabix prefix.dip.vcf.gz chr1:62784-64408
chr1 62784 . G A 30 . . GT:AD 1|1:0,2
chr1 62915 . G A 30 . . GT:AD 1|1:0,2
chr1 63003 . C T 30 . . GT:AD 1|1:0,2
chr1 63015 . A AAAT 30 . . GT:AD 1|1:0,2
chr1 64401 . N A 30 . . GT:AD 1|1:0,2
chr1 64402 . N C 30 . . GT:AD 1|1:0,2
chr1 64403 . N C 30 . . GT:AD 1|1:0,2
chr1 64404 . N A 30 . . GT:AD 1|1:0,2
chr1 64405 . N A 30 . . GT:AD 1|1:0,2
chr1 64406 . N T 30 . . GT:AD 1|1:0,2
chr1 64407 . N G 30 . . GT:AD 1|1:0,2
chr1 64408 . N C 30 . . GT:AD 1|1:0,2
samtools faidx ref.fa.gz chr1:62784-64408
>chr1:62784-64408
GACTGTTACAGGTTCCAGCAGGATAACTGGGTGGAAATGAGTTTGGTTTCACTTAGTCTC
TCTAAAGAGAAAGCAAGTTGGTAGACTAATACCTAATAAAAGCAAAGCTGCCAACAATTG
AAATTGCCTGGGCTGCTCTGTGTGTCCCACATGCATGGGTGTGGGTGCCAGTGTGTGTGC
GTGTGTGCATGCATGTGCATGTGTGTTGGGATAGAGTGGCAAGAAAATGGGAAATAAGAA
TGTTCAGTCCATAGCCCTTCATTATAAAAAGGTGAGCTGTAATAAATACTAGTGCCACAT
TTAGCCAAAACTTTACTCCAGCCAAAGGTGATATTTTCATGATAACATCCTGTGATTGCT
TTGTTCTTCGTCTTTTATGTTCTTCCTAGATGGGCTCAGAACATACAAGAATTAAGTACA
CATCTTATTTTCCAGTGATAATGCTACCGGCAAATTCTGTTGTTTGTATAAACATCAGCC
ATGTTTATATAACTAAACTAGTGTTTTGTTTTGTCAATTCAGCAAGAAATTAGACCACAT
GGTGGCTTAATGCTGCATTGATTTGGCTATCAATTTGTTTTCACTTTTCTGCAAAATATT
TAATACATTATTAAATTGAATTATGCTGATGCCACAGTTGTTCTTATCTCAATTGTCTTA
AAATTCATTTAATTTTTTTTTCCTTTGGTTTCATTATTCAAATTTTAACTTCAGTTCTCA
ACATTTTATCTGATGGAAGAGATGGAGTCCATTACTAAGGACTCCATTGTGCTCCATCAT
GCCAGAGTTGTAAAATAGATCTTTTAAAGGAAATTTACTGTGATTTTTTTCTATTTAAGA
GCTTCCTCTCCAGTTGAGCATGTAAGAAAATTATACCAGGAGAATACAGTAAACTCTATG
AGGCAAGCTATAAACATGTAGCATTGTGATTAGGGCTGGTTCTCCTTCTAGAGACATGGT
AGGATTGCAATTTCATACCATCCTTGAAGTTAGAGAGAGCCACGTGACTCATTTAGCCAA
TGAACTGTGAGCAGAATGACATGTCACTTCCAGCAGAAGCTTTAAGAATCTGAGAGACAT
TCATACGTTTTCCATGTGCTGTAGCCTTATACCCAAAGCCTGGGTCCCAAGTGACCATGA
CAGGCAGAGCTCCCTGTTGAGCCACAGAGATTTAGAGAATGGCTGTTAACACAGCATAAT
CCAGCCCATCCTGACTAATCTGATATTAACATGTATAATAAAGAATTCTATCAATGCTGA
GGGAAGATGATTAGTTAAGGTCCTAGGTTGCAAGTCTCAAAACCTCTTCTAAGGATTGTA
GACAGGAAATTAAATGACTTCTAGTCCCTAGAGTTCCCAATCTCCTACCATCCCATCCTA
ATATGACAGAAGTAATTCCTGAGTTGCTTCTGAAACCAGAGCTTCCCTCAGAACCCTTAG
CCTGCCAGATGGCTTCTTGGAGAGCCCTCACTCACTTTTCTCCTTCTGCTATTGCTGCTC
ATTCATTCCAGCTTTTAAAAATTCATCTTTATCCAGGAACCTCGCTTCTAGAAAAGTCAT
ACAGGTGCTTCCAGGAGGCTACATGGGCACCCATATTTTTCTAGCCACTTTCATTAGACC
AATGC
printf "chr1\t62784\t64408\n" | bedtools intersect -a prefix.dip.bed -wa -b -
chr1 2810 216558
Looking in IGV, the transition to the dense block of variants with ref Ns is not due to the start of a new alignment - and the variants with N as reference are not supported by the haplotype specific assembly to ref alignments.
Update on this - I just used an alternative tool to decode variants from prefix.hap1.bam to vcf and the resulting vcf looks fine (no Ns) and consistent with the alignments (see IGV screenshot).
medaka tools consensus2vcf hap1.fa.gz ref.fa.gz --bam prefix.hap1.bam --out_prefix medaka_consensus2vcf.hap1
So it suppose this might point to an issue with htsbox pileup?