gatk
gatk copied to clipboard
Audit MuTect2 headers to describe fragment vs read annotations
This request was created from a contribution made by Brian Wiley on March 18, 2022 22:33 UTC.
--
394224746911, do you think it would be possible for Broad to start documenting things like this and most of the tags in your VCF files. More like simple things not like the algorithm for clustered events. For instance I have variant from which I can get the depth from the bam file with samtools with a variant called by Mutect at position chr17:60663118 C>T. The depth of my bam file at this position is 170 (after removing duplicates; 222 before removing duplicates which is what shows in IGV).
$ samtools depth C484.TCGA-19-2620-10A-01D-1495-08.5_gdc_realn.bam -r chr17:60663117-60663119
chr17 60663117 172
chr17 60663118 170
chr17 60663119 173
This coincides exactly with IGV (also shows 170 for position 60663118 when remove duplicates is turned on).
$ samtools view C484.TCGA-19-2620-10A-01D-1495-08.5_gdc_realn.bam chr17:60663118-60663118 | wc -l
222
However Mutect has 2 depth DP tags, an INFO/DP and a FMT/DP. The format DP shows a depth of 163 at this position which makes sense that it's at least lower. But the INFO/DP shows a depth of DP=318 which makes almost zero sense (that's almost double!!) and in the VCF file it even indicates some reads are filtered...
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
Additionally the AS_SB_TABLE shows "164,135|3,2" in which the total is 304 which again makes almost zero sense. So from a researcher perspective it is absolutely necessary to know how Mutect2 is calculating these numbers else the strand bias filter cannot be trusted at all. I am using version 4.2.1.0 of GATK and below are the variant and the call information:
Variant
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TCGA-19-2620-10A-01D-1495-08
chr17 60663118 . C T . weak_evidence AS_FilterStatus=weak_evidence;AS_SB_TABLE=164,135|3,2;DP=318;ECNT=1;GERMQ=93;MBQ=33,27;
MFRL=182,128;MMQ=60,60;MPOS=17;NALOD=1.99;NLOD=28.89;POPAF=6;ROQ=32;TLOD=6.59;AC=1;AN=2;MQ0=0;samtools_DP=170;SAMPLE=TCGA-19-2620-10A-01D-1495-08;PON_RefDepth=
1532;PON_AltDepth=0;PON_FISHER=0;CSQ=T|stop_gained|HIGH|PPM1D|ENSG00000170836|Transcript|ENST00000305921.8|protein_coding|6/6||ENST00000305921.8:c.1384C>T|ENSP
00000306682.2:p.Gln462Ter|1606|1384|462|Q/*|Caa/Taa|CM131995&COSV59955543||1||SNV|HGNC|HGNC:9277|YES|NM_003620.4||1|P1|CCDS11625.1|ENSP00000306682|O15297.184|A
0A0S2Z4M2.32|UPI0000130FE8|O15297-1||Ensembl||C|C||1||||||||||||||||||||||||||0&1|1&1||||||||MAGLYSLGVSVFSDQGGRKYMEDVTQIVVEPEPTAEEKPSPRRSLSQPLPPRPSPAALPGGEVSGK
GPAVAAREARDPLPDAGASPAPSRCCRRRSSVAFFAVCDGHGGREAAQFAREHLWGFIKKQKGFTSSEPAKVCAAIRKGFLACHLAMWKKLAEWPKTMTGLPSTSGTTASVVIIRGMKMYVAHVGDSGVVLGIQDDPKDDFVRAVEVTQDHKPELPKER
ERIEGLGGSVMNKSGVNRVVWKRPRLTHNGPVRRSTVIDQIPFLAVARALGDLWSYDFFSGEFVVSPEPDTSVHTLDPQKHKYIILGSDGLWNMIPPQDAISMCQDQEEKKYLMGEHGQSCAKMLVNRALGRWRQRMLRADNTSAIVICISPEVDNQGN
FTNEDELYLNLTDSPSYNSQETCVMTPSPCSTPPVKSLEEDPWPRVNSKDHIPALVRSNAFSENFLEVSAEIARENVQGVVIPSKDPEPLEENCAKALTLRIHDSLNNSLPIGLVPTNSTNTVMDQKNLKMSTPGQMKAQEIERTPPTNFKRTLEESNS
GPLMKKHRRNGLSRSSGAQPASLPTTSQRKNSVKLTMRRRLRGQKKIGNPLLHQHRKTVCVC|||||||||||||||||||||||||||||| GT:AD:AF:DP:F1R2:F2R1:SB 0/1:158,5:0.031:163:70,2:79,2:8
7,71,3,2
Call (your guys also sequenced this sample for the TCGA since the center is -08 TCGA-19-2620-10A-01D-1495-08):
NORMAL=$(samtools view -H $normal_bam | /usr/bin/perl -nE 'say $1 if /^\@RG.+\tSM:([ -~]+)/' | head -n 1)
TUMOR=$(samtools view -H $tumor_bam | /usr/bin/perl -nE 'say $1 if /^\@RG.+\tSM:([ -~]+)/' | head -n 1)
/gatk/gatk Mutect2 --java-options "-Xmx8g" -O $1 -R $2 -I $3 -tumor "$TUMOR" -I $4 -normal "$NORMAL" -L $5 --f1r2-tar-gz f1r2.tar.gz #Running Mutect2.
/gatk/gatk LearnReadOrientationModel -I f1r2.tar.gz -O artifact.priors.tar.gz
/gatk/gatk FilterMutectCalls --java-options "-Xmx8g" -R $2 -V $1 -O $6 -ob-priors artifact.priors.tar.gz #Running FilterMutectCalls on the output vcf.
(created from Zendesk ticket #278802)
gz#278802