graphtyper
graphtyper copied to clipboard
better explanation of graphtyper output VCF
Hello,
Can you please provide some more details on the format of graphtyper output VCF? I'm interested essentially in 2 points:
- Variants definitions. Take the following example: Given my merged sites input from SVIMMER containing these variants
chr2 122612415 . T [chr2:122612659[T 0 . SVTYPE=BND;MATEID=MantaBND:101955:0:0:5:0:0:1;CIPOS=0,4;HOMLEN=4;HOMSEQ=TGGC;BND_DEPTH=15;MATE_BND_DEPTH=12;NUM_MERGED_SVS=1;STDDEV_POS=0,0
chr2 122612415 . TTGGCAATAGCGAGGTAGTGGTAACAGTGGGCTTTGGGCAAGACCCAGTTCTGTGCTGGCCTCTGGTCTGACCCAGCACAGTCCCACTGGTGGTAGCCACAGGGGTGCTTCTGTCACTTTATCCTCAGCTCCAGGCAGCCCAGCAGAGAGAAGGAGATTCCATTTGTTTGGGAAAAAGTAAGAGAACCAGTGTCTCTGCCAGGTAATCTAGAGAATTTTTCCAGATTTTATCCAAGACCACCAAGGTCTTGGGCAAGAACCACAGCATTACTGGGCTTGGAGTGCTCCCTAATGCAGAAATAACTTAGATCACAACATCCAAGACTTTTTTTTTTTTTTTTTTTTGCTATC T 0 . END=122612765;SVTYPE=DEL;SVLEN=-350;CIGAR=1M350D;CIPOS=0,3;HOMLEN=3;HOMSEQ=TGG;NUM_MERGED_SVS=1;STDDEV_POS=0,0
The graphtyper output VCF reports
chr2 122612415 chr2:122612415:DG N <DEL:SVSIZE=350:AGGREGATED> 1419 PASS ABHet=0.4462;ABHom=1;AC=11;AF=0.1667;AN=66;END=122612765;MaxAAS=14;MaxAASR=0.5417;MaxAltPP=14;NHet=11;NHomAlt=0;NHomRef=22;NUM_MERGED_SVS=1;PASS_AC=4;PASS_AN=42;PASS_ratio=0.6364;QD=18.13;RefLen=1;SVLEN=350;SVMODEL=AGGREGATED;SVSIZE=350;SVTYPE=DEL;SV_ID=2;SeqDepth=622;VarType=DG;GCF=0.467236
chr2 122612415 chr2:122612415:DG.0 N <DEL:SVSIZE=350:BREAKPOINT> 1497 PASS ABHet=0.4293;ABHom=1;AC=12;AF=0.1818;AN=66;END=122612765;MaxAAS=14;MaxAASR=0.7143;MaxAltPP=14;NHet=12;NHomAlt=0;NHomRef=21;NUM_MERGED_SVS=1;PASS_AC=9;PASS_AN=56;PASS_ratio=0.8485;QD=18.89;RefLen=1;SVLEN=350;SVMODEL=BREAKPOINT;SVSIZE=350;SVTYPE=DEL;SV_ID=2;SeqDepth=604;VarType=DG;GCF=0.467236
chr2 122612415 chr2:122612415:DG.1 N <DEL:SVSIZE=350:COVERAGE> 228 LowQD ABHet=0.375;ABHom=0.9427;AC=8;AF=0.1212;AN=66;END=122612765;MaxAAS=10;MaxAASR=0.5714;MaxAltPP=0;NHet=8;NHomAlt=0;NHomRef=25;NUM_MERGED_SVS=1;PASS_AC=4;PASS_AN=46;PASS_ratio=0.697;QD=4;RefLen=1;SVLEN=350;SVMODEL=COVERAGE;SVSIZE=350;SVTYPE=DEL;SV_ID=2;SeqDepth=588;VarType=DG;GCF=0.467236
chr2 122612415 chr2:122612415:OG T [chr2:122612659[T 678 PASS ABHet=0.2952;ABHom=1;AC=12;AF=0.1818;AN=66;END=122612415;MaxAAS=10;MaxAASR=0.7143;NHet=12;NHomAlt=0;NHomRef=21;NUM_MERGED_SVS=1;PASS_AC=7;PASS_AN=52;PASS_ratio=0.7879;QD=13.84;RefLen=1;SVMODEL=AGGREGATED;SVTYPE=BND;SV_ID=1;SeqDepth=565;VarType=OG;GCF=0.506173
Can you please explain the meaning of the 3 DEL lines? I think they represent the same call, but using 3 different models for genotyping: AGGREGATED
, BREAKPOINT
and COVERAGE
. This create issues downstream since I have essentially redundant calls. Can you provide any suggestion on how to use / combine these 3 outputs? If I want to obtain a clean list of variants, which of the 3 do you suggest to use?
- FILTER tags in the FORMAT fields For filters added to the FORMAT field, the VCF header do not describe the meaning of the various FILTERN tags. Can you provide more details on the meaning of FILTER1, FILTER2, etc. ?
I have the same question regarding the different "FAILN" tags. So I did a quick check on my VCF file, It seems "FAILN" were assigned based on the GQ value, here is what I found:
PASS: 99 >= GQ > 40 FAIL1: 40 >= GQ > 21 FAIL2: 21 >= GQ > 0 FAIL3: GQ = 0
The author also mentioned in their paper that the high confidence filtration used in graphtyper would bring a big loss of true positives, so they didn't set this filtration as default. It might be more reasonable to filter genotypes based on a certain GQ threshold, instead of filtering out all "failed' records.
I am not sure if the GQ value is the only rule to assign different FAILN tags.
Hope that helps.
$ bcftools concat test.lst.genotype.vcf/chr21/*.vcf.gz | grep -v ^# | grep -v "LowQUAL" |grep -v "LowQD" |less -SN ... 235 chr21 46237437 chr21:46237437:DG N DEL:SVSIZE=1322:AGGREGATED 255 PASS ABHet=-1;ABHom=1;AC=2;AF=0.5;AN=4;END=46238759;MaxAAS=22;MaxAASR=1;MaxAltPP=0;NHet=0;NHomAlt=1;NHomRef=1;NUM_MERGED_SVS=1;PASS_AC=2;PASS_AN=2;PASS_ratio= 236 chr21 46237437 chr21:46237437:DG.0 N DEL:SVSIZE=1322:BREAKPOINT 255 PASS ABHet=-1;ABHom=1;AC=2;AF=0.5;AN=4;END=46238759;MaxAAS=25;MaxAASR=1;MaxAltPP=25;NHet=0;NHomAlt=1;NHomRef=1;NUM_MERGED_SVS=1;PASS_AC=2;PASS_AN=4;PASS_ratio 237 chr21 46237437 chr21:46237437:DG.1 N DEL:SVSIZE=1322:COVERAGE 273 PASS ABHet=0.3125;ABHom=1;AC=3;AF=0.75;AN=4;END=46238759;MaxAAS=22;MaxAASR=1;MaxAltPP=0;NHet=1;NHomAlt=1;NHomRef=0;NUM_MERGED_SVS=1;PASS_AC=2;PASS_AN=2;PASS_r
What if all three lines are passes?