gatk
gatk copied to clipboard
AS_FilterStatus uses an incorrect number of fields for some records
Bug Report
Affected tool
Mutect2
Affected versions
- [x] Latest public release version 4.1.8.1
- [x] Latest master branch as of 9/20/2020
Description
Mutect2’s header defines AS_FilterStatus as follows:
##INFO=<ID=AS_FilterStatus,Number=A,Type=String,Description="Filter status for each allele, as assessed by ApplyRecalibration. Note that the VCF filter field will reflect the most lenient/sensitive status across all alleles.">
AS_FilterStatus uses the pipe character | for per-allele concatenation and a comma , for filter concatenation. This causes records to have an incorrect number of values at sites with multiple filters or multiple alleles. Some examples:
chr1 826950 . G T . clustered_events;contamination;map_qual;strand_bias AS_FilterStatus=map_qual,strand_bias,contamination;AS_SB_TABLE=86,101|7,0;DP=199;ECNT=3;GERMQ=93;MBQ=35,34;MFRL=193,211;MMQ=33,27;MPOS=6;NALOD=1.28;NLOD=5.42;POPAF=1.39;ROQ=80;TLOD=8.79 GT:AD:AF:DP:F1R2:F2R1:SB 0/0:36,0:0.05:36:18,0:18,0:24,12,0,0 0/1:151,7:0.055:158:76,4:71,3:62,89,7,0
chr1 3633298 . GT G,GTT . contamination;multiallelic;normal_artifact;slippage;weak_evidence AS_FilterStatus=weak_evidence,contamination|weak_evidence,contamination;AS_SB_TABLE=89,7|7,0|7,0;DP=129;ECNT=1;GERMQ=67;MBQ=20,20,20;MFRL=0,0,0;MMQ=60,60,60;MPOS=17,31;NALOD=-0.2424,0.21;NLOD=6.4,6.36;POPAF=2.49,2.04;ROQ=93;RPA=11,10,12;RU=T;STR;STRQ=1;TLOD=3.04,4.6 GT:AD:AF:DP:F1R2:F2R1:SB 0/0:58,4,4:0.072,0.069:66:28,2,2:28,2,2:54,4,8,0 0/1/2:38,3,3:0.083,0.084:44:21,1,3:15,2,0:35,3,6,0
A quick fix would be to define Number=1 for AS_FilterStatus in the VCF header. Alternatively, using a pipe for filter concatenation and a comma for per-allele concatenation might be more compliant with the VCF specification.
@ldgauthier It seems that we might be violating the VCF spec with the way values are delimited in the AS_FilterStatus annotation. How would you feel about correcting this?
@droazen Are you thinking String with count=1 and then DIY parsing? Despite years of discussions, we never made any progress on lists of lists in the VCF spec.
It looks like that's just what Don did in #6858
I am facing same problem when I tried to merge Mutect2 vcf files (from several patients) to one. I tried to use bcftools merge, but I got a same error. I used Mutect2 of GATK4.1.9.0.
From when does Mutect2 have this error?
I am thinking of downgrade GATK tools such as 4.1.7.0 or older version. Does anyone have information? Which GATK4 Mutect2 version should I try?
Is this issue fixed yet? @TnakaNY I am using GATK v4.1.7.0 and still getting the error wrong number of fields in AS_FilterStatus? while merging Mutect2 VCF files from multiple donors using the bcftools merge.
I am running v4.2.4.0 and still have the issue. It seems that mutect2 is doing the wrong thing of separating the AS_FilterStatus for different allele using | and listing many AS_FilterStatus per alleles using , but the rule is actually the opposite apparently https://github.com/samtools/bcftools/issues/1570
I came across this problem when using bcftools merge on some Mutect2 VCFs and as suggested, a quick fix is to change AS_FilterStatus in the VCF header. However I changed Number to ., which according to the VCF spec:
If the number of possible values varies, is unknown, or is unbounded, then this value should be ‘.’.
Sharing my code below in case it is useful for others:
bcftools head my.vcf \
| perl -nle "if (/ID=AS_FilterStatus,/){ s/Number=A/Number=./ } print" > my.header.txt
bcftools reheader -h my.header.txt my.vcf \
| bgzip > my.vcf.gz
tabix my.vcf.gz
I have the same problem and I use sed to change the header line before merging.
But could you consider changing the allele separator in this field to , and keep the Number=A in the header to allow for easy splitting of the field with bcftools norm -m, as in the examples below:
Number=A - split
zcat test.vcf.gz
##fileformat=VCFv4.2
##INFO=<ID=AS_FilterStatus,Number=A,Type=String,Description="Some Description">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT s1 s2
chr1 1000 . A T,C 100 PASS ANNOTATION=aboutT,aboutC GT 0/2 0/1
bcftools norm -m -any test.vcf.gz
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=AS_FilterStatus,Number=A,Type=String,Description="Some Description">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr1>
##bcftools_normVersion=1.10.2+htslib-1.10.2
##bcftools_normCommand=norm -m -any test.vcf.gz; Date=Wed Jun 8 20:51:27 2022
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT s1 s2
chr1 1000 . A T 100 PASS AS_FilterStatus=aboutT GT 0/0 0/1
chr1 1000 . A C 100 PASS AS_FilterStatus=aboutC GT 0/1 0/0
Number=. - no split
zcat test1.vcf.gz
##fileformat=VCFv4.2
##INFO=<ID=AS_FilterStatus,Number=.,Type=String,Description="Some Description">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT s1 s2
chr1 1000 . A T,C 100 PASS AS_FilterStatus=aboutT|aboutC GT 0/2 0/1
bcftools norm -m -any test1.vcf.gz
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=AS_FilterStatus,Number=.,Type=String,Description="Some Description">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr1>
##bcftools_normVersion=1.10.2+htslib-1.10.2
##bcftools_normCommand=norm -m -any test1.vcf.gz; Date=Wed Jun 8 21:14:22 2022
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT s1 s2
chr1 1000 . A T 100 PASS AS_FilterStatus=aboutT|aboutC GT 0/0 0/1
chr1 1000 . A C 100 PASS AS_FilterStatus=aboutT|aboutC GT 0/1 0/0
Best, Kasia
Yes we made a workflow on Terra to debug this: https://dockstore.org/workflows/github.com/broadinstitute/depmap_omics/omics_mutect2:dev?tab=files this is the fix_mutect2.wdl file I believe
Hello!
I wrote a script to circumvent this error:
#!/usr/bin/python3
# makes Mutect2 output vcf compliant with VCF specification in case of AS_FilterStatus delimiters at multiallelic loci
# ("wrong number of fields in AS_FilterStatus?")
# https://github.com/broadinstitute/gatk/issues/6857
# usage: cat sample.vcf | correct_mutect.py | bgzip > sample.right.vcf.gz && tabix -p vcf sample.right.vcf.gz
import sys
if __name__ == "__main__":
for l in sys.stdin:
if l[0] != "#":
cols = l.split("\t")
# if "," in cols[4]: # not sure if the correction is needed at biallelic sites
if True:
info = cols[7].split(";")
for i in range(len(info)):
a = info[i]
if a[:16] == "AS_FilterStatus=":
b = list(map(lambda c: c.split(","), a[16:].split("|")))
info[i] = "AS_FilterStatus=" + ",".join(map(lambda c: "|".join(c), b))
cols = cols[:7] + [";".join(info)] + cols[8:]
l = "\t".join(cols)
print(l, end = "")
Hello!
Same problem with comma in Description field in VCF header.
I wrote a awk script to prevent error in further tools that can not deal with this format.
fix_vcf_header_with_comma_in_description.awk
BEGIN {
sep=";" # alternative separator
}
{
# header line with comma in Description
if ( ($0 ~ /^##INFO/ || $0 ~ /^##FORMAT/) && $0 ~ /Description=\".*,.*\"/ ) {
match($0, /^(.*)(Description=\".*\")(.*)$/, arr);
gsub(",",sep,arr[2]);
print arr[1] arr[2] arr[3]
# other lines
} else {
print $0
}
}
Usage:
awk -f fix_vcf_header_with_comma_in_description.awk example.vcf
cat example.vcf | awk -f fix_vcf_header_with_comma_in_description.awk
bcftools view example.vcf | awk -f fix_vcf_header_with_comma_in_description.awk