gatk icon indicating copy to clipboard operation
gatk copied to clipboard

AS_FilterStatus uses an incorrect number of fields for some records

Open DonFreed opened this issue 5 years ago • 12 comments

Bug Report

Affected tool

Mutect2

Affected versions

  • [x] Latest public release version 4.1.8.1
  • [x] Latest master branch as of 9/20/2020

Description

Mutect2’s header defines AS_FilterStatus as follows:

##INFO=<ID=AS_FilterStatus,Number=A,Type=String,Description="Filter status for each allele, as assessed by ApplyRecalibration. Note that the VCF filter field will reflect the most lenient/sensitive status across all alleles.">

AS_FilterStatus uses the pipe character | for per-allele concatenation and a comma , for filter concatenation. This causes records to have an incorrect number of values at sites with multiple filters or multiple alleles. Some examples:

chr1    826950  .       G       T       .       clustered_events;contamination;map_qual;strand_bias    AS_FilterStatus=map_qual,strand_bias,contamination;AS_SB_TABLE=86,101|7,0;DP=199;ECNT=3;GERMQ=93;MBQ=35,34;MFRL=193,211;MMQ=33,27;MPOS=6;NALOD=1.28;NLOD=5.42;POPAF=1.39;ROQ=80;TLOD=8.79       GT:AD:AF:DP:F1R2:F2R1:SB        0/0:36,0:0.05:36:18,0:18,0:24,12,0,0    0/1:151,7:0.055:158:76,4:71,3:62,89,7,0
chr1    3633298 .       GT      G,GTT   .       contamination;multiallelic;normal_artifact;slippage;weak_evidence    AS_FilterStatus=weak_evidence,contamination|weak_evidence,contamination;AS_SB_TABLE=89,7|7,0|7,0;DP=129;ECNT=1;GERMQ=67;MBQ=20,20,20;MFRL=0,0,0;MMQ=60,60,60;MPOS=17,31;NALOD=-0.2424,0.21;NLOD=6.4,6.36;POPAF=2.49,2.04;ROQ=93;RPA=11,10,12;RU=T;STR;STRQ=1;TLOD=3.04,4.6      GT:AD:AF:DP:F1R2:F2R1:SB        0/0:58,4,4:0.072,0.069:66:28,2,2:28,2,2:54,4,8,0        0/1/2:38,3,3:0.083,0.084:44:21,1,3:15,2,0:35,3,6,0

A quick fix would be to define Number=1 for AS_FilterStatus in the VCF header. Alternatively, using a pipe for filter concatenation and a comma for per-allele concatenation might be more compliant with the VCF specification.

DonFreed avatar Sep 30 '20 22:09 DonFreed

@ldgauthier It seems that we might be violating the VCF spec with the way values are delimited in the AS_FilterStatus annotation. How would you feel about correcting this?

droazen avatar Oct 05 '20 18:10 droazen

@droazen Are you thinking String with count=1 and then DIY parsing? Despite years of discussions, we never made any progress on lists of lists in the VCF spec.

ldgauthier avatar Oct 06 '20 15:10 ldgauthier

It looks like that's just what Don did in #6858

ldgauthier avatar Oct 06 '20 16:10 ldgauthier

I am facing same problem when I tried to merge Mutect2 vcf files (from several patients) to one. I tried to use bcftools merge, but I got a same error. I used Mutect2 of GATK4.1.9.0.

From when does Mutect2 have this error?

I am thinking of downgrade GATK tools such as 4.1.7.0 or older version. Does anyone have information? Which GATK4 Mutect2 version should I try?

TnakaNY avatar Dec 21 '20 23:12 TnakaNY

Is this issue fixed yet? @TnakaNY I am using GATK v4.1.7.0 and still getting the error wrong number of fields in AS_FilterStatus? while merging Mutect2 VCF files from multiple donors using the bcftools merge.

asntech avatar Aug 04 '21 10:08 asntech

I am running v4.2.4.0 and still have the issue. It seems that mutect2 is doing the wrong thing of separating the AS_FilterStatus for different allele using | and listing many AS_FilterStatus per alleles using , but the rule is actually the opposite apparently https://github.com/samtools/bcftools/issues/1570

jkobject avatar Jan 31 '22 15:01 jkobject

I came across this problem when using bcftools merge on some Mutect2 VCFs and as suggested, a quick fix is to change AS_FilterStatus in the VCF header. However I changed Number to ., which according to the VCF spec:

If the number of possible values varies, is unknown, or is unbounded, then this value should be ‘.’.

Sharing my code below in case it is useful for others:

bcftools head my.vcf \
   | perl -nle "if (/ID=AS_FilterStatus,/){ s/Number=A/Number=./ } print" > my.header.txt

bcftools reheader -h my.header.txt my.vcf \
   | bgzip > my.vcf.gz
tabix my.vcf.gz

davetang avatar May 10 '22 02:05 davetang

I have the same problem and I use sed to change the header line before merging. But could you consider changing the allele separator in this field to , and keep the Number=A in the header to allow for easy splitting of the field with bcftools norm -m, as in the examples below: Number=A - split

zcat test.vcf.gz 
##fileformat=VCFv4.2
##INFO=<ID=AS_FilterStatus,Number=A,Type=String,Description="Some Description">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  s1      s2
chr1    1000    .       A       T,C     100     PASS    ANNOTATION=aboutT,aboutC        GT      0/2     0/1

bcftools norm -m -any test.vcf.gz
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=AS_FilterStatus,Number=A,Type=String,Description="Some Description">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr1>
##bcftools_normVersion=1.10.2+htslib-1.10.2
##bcftools_normCommand=norm -m -any test.vcf.gz; Date=Wed Jun  8 20:51:27 2022
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  s1      s2
chr1    1000    .       A       T       100     PASS    AS_FilterStatus=aboutT  GT      0/0     0/1
chr1    1000    .       A       C       100     PASS    AS_FilterStatus=aboutC  GT      0/1     0/0

Number=. - no split

zcat test1.vcf.gz 
##fileformat=VCFv4.2
##INFO=<ID=AS_FilterStatus,Number=.,Type=String,Description="Some Description">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  s1      s2
chr1    1000    .       A       T,C     100     PASS    AS_FilterStatus=aboutT|aboutC   GT      0/2     0/1

bcftools norm -m -any test1.vcf.gz
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=AS_FilterStatus,Number=.,Type=String,Description="Some Description">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr1>
##bcftools_normVersion=1.10.2+htslib-1.10.2
##bcftools_normCommand=norm -m -any test1.vcf.gz; Date=Wed Jun  8 21:14:22 2022
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  s1      s2
chr1    1000    .       A       T       100     PASS    AS_FilterStatus=aboutT|aboutC   GT      0/0     0/1
chr1    1000    .       A       C       100     PASS    AS_FilterStatus=aboutT|aboutC   GT      0/1     0/0

Best, Kasia

kasia-te avatar Jun 08 '22 21:06 kasia-te

Yes we made a workflow on Terra to debug this: https://dockstore.org/workflows/github.com/broadinstitute/depmap_omics/omics_mutect2:dev?tab=files this is the fix_mutect2.wdl file I believe

jkobject avatar Jun 10 '22 20:06 jkobject

Hello!

I wrote a script to circumvent this error:

#!/usr/bin/python3

# makes Mutect2 output vcf compliant with VCF specification in case of AS_FilterStatus delimiters at multiallelic loci
# ("wrong number of fields in AS_FilterStatus?")

# https://github.com/broadinstitute/gatk/issues/6857

# usage: cat sample.vcf | correct_mutect.py | bgzip > sample.right.vcf.gz && tabix -p vcf sample.right.vcf.gz

import sys

if __name__ == "__main__":
    for l in sys.stdin:
        if l[0] != "#":
            cols = l.split("\t")
            # if "," in cols[4]: # not sure if the correction is needed at biallelic sites
            if True:
                info = cols[7].split(";")
                for i in range(len(info)):
                    a = info[i]
                    if a[:16] == "AS_FilterStatus=":
                        b = list(map(lambda c: c.split(","), a[16:].split("|")))
                        info[i] = "AS_FilterStatus=" + ",".join(map(lambda c: "|".join(c), b))
                cols = cols[:7] + [";".join(info)] + cols[8:]
                l = "\t".join(cols)
        print(l, end = "")

dining-philosopher avatar Sep 14 '22 12:09 dining-philosopher

Hello!

Same problem with comma in Description field in VCF header.

I wrote a awk script to prevent error in further tools that can not deal with this format.

fix_vcf_header_with_comma_in_description.awk

BEGIN {
    sep=";"    # alternative separator 
}
{
    # header line with comma in Description
    if ( ($0 ~ /^##INFO/ || $0 ~ /^##FORMAT/) && $0 ~ /Description=\".*,.*\"/ ) {
        match($0, /^(.*)(Description=\".*\")(.*)$/, arr);
        gsub(",",sep,arr[2]);
        print arr[1] arr[2] arr[3]
    # other lines
    } else {
        print $0
    }
}

Usage:

awk -f fix_vcf_header_with_comma_in_description.awk example.vcf
cat example.vcf | awk -f fix_vcf_header_with_comma_in_description.awk
bcftools view example.vcf | awk -f fix_vcf_header_with_comma_in_description.awk

antonylebechec avatar Feb 08 '23 11:02 antonylebechec