gatk icon indicating copy to clipboard operation
gatk copied to clipboard

Mutect2 (and HaplotypeCaller) version 4.1.3.0 miscounts number of alleles when overlapping reads are present, as if the "--count-reads" was set to true by default

Open freeseek opened this issue 4 years ago • 13 comments

I obtain this reproducible issue with gatk 4.1.3.0:

First of all, you have to download the mini input.bam file from this dropbox link: https://www.dropbox.com/sh/78rz5wrhu9zkfzh/AACW9ZPhl4WnD-wmAkKcdHT3a?dl=0

Then setup a GATK working environment:

wget https://github.com/broadinstitute/picard/releases/download/2.19.0/picard.jar

wget https://github.com/broadinstitute/gatk/releases/download/4.1.3.0/gatk-4.1.3.0.zip
unzip gatk-4.1.3.0.zip

wget -O- ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405\
.15_GRCh38_no_alt_analysis_set.fna.gz | \
  gzip -d > GCA_000001405.15_GRCh38_no_alt_analysis_set.fna

samtools faidx GCA_000001405.15_GRCh38_no_alt_analysis_set.fna

java -jar picard.jar \
  CreateSequenceDictionary \
  R=GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
  O=GCA_000001405.15_GRCh38_no_alt_analysis_set.dict

Now if I run Mutect2:

gatk-4.1.3.0/gatk \
  Mutect2 \
  -R GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
  -I input.bam \
  -O output.vcf.gz \
  -L chr1:233443225-233443225

This will generate a VCF file with one variant:

GT:AD:AF:DP:F1R2:F2R1:SB
0/1:6,21:0.778:27:4,8:0,11:2,4,12,9

With an allelic depth of six supporting the reference.

However, there are only four fragments supporting the reference. If I remove those for fragments from the BAM file:

samtools view -h input.bam | \
  grep -v ":6112\|:10233\|:18618\|:20229" | \
  samtools view -Sb -o input2.bam && \
  samtools index input2.bam

And I run Mutect2 again:

gatk-4.1.3.0/gatk \
  Mutect2 \
  -R GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
  -I input2.bam \
  -O output.vcf.gz \
  -L chr1:233443225-233443225

It will generate a VCF with the same variant:

GT:AD:AF:DP:F1R2:F2R1:SB
0/1:0,20:0.954:20:0,7:0,11:0,0,11,9

With an allelic depth of zero supporting the reference.

The same problem exists with the HaplotypeCaller.

I believe this was not the intended behavior and it is really messing up with my use case. Short of removing half of the reads from my BAM file, is there a solution for this?

freeseek avatar Aug 15 '19 16:08 freeseek

I have just realized that this is indeed a regression from GATK 4.0 to GATK 4.1:

wget https://github.com/broadinstitute/gatk/releases/download/4.0.12.0/gatk-4.0.12.0.zip
unzip gatk-4.0.12.0.zip

Running this:

gatk-4.0.12.0/gatk \
  Mutect2 \
  -R GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
  -I input.bam \
  -O output.vcf.gz \
  -L chr1:233443225-233443225 \
  --tumor SM

Generates a VCF file with one variant:

GT:AD:AF:DP:F1R2:F2R1:SAAF:SAPP
0/1:3,15:0.799:18:3,7:0,8:0.818,0.818,0.833:0.028,0.025,0.948

So that the AD counts make sense with the previous version.

freeseek avatar Aug 15 '19 21:08 freeseek

@freeseek This is a regrettable but temporary regression done for the sake of making Mutect2 much more principled ultimately. Let me try to explain with a timeline

  • ~6 months ago: Mutect2 throws away one read whenever mates overlap. This reduces sensitivity unnecessarily, especially for indels, and messes up several annotations, although it does make the ADs come out right.
  • ~3 months ago: we no longer throw out reads, and instead modify base and indel quals of overlapping mates to account for the possibility of PCR error. This improves sensitivity and strand, orientation, and position annotations, but it is a genuine regression in AD.
  • [in 2nd round of code review, probably merged in a week]: Mutect2's ReadLikelihoods matrix forces mates to support the same haplotype and the entire likelihood framework is rewritten to allow pairs (or indeed, arbitrary groups of reads) as the atomic unit of data.
  • [next step, 1 - 2 months?]: rewrite the annotations engine to accept read likelihoods for some annotations and pair likelihoods for others (such as AD).

davidbenjamin avatar Aug 23 '19 01:08 davidbenjamin

@davidbenjamin thank you for your work and thank you for the clarification. Is this going to be code shared by the HaplotypeCaller as well? A lot of the analyses I do are dependent on the variance of the AD counts to be properly modeled, so I look forward to this improvement being implemented.

freeseek avatar Aug 23 '19 16:08 freeseek

@davidbenjamin Perhaps as a stopgap measure while we wait for the ultimate fix we could add an opt-in argument to M2/HC to toggle the old behavior (throw away one read whenever mates overlap)?

droazen avatar Sep 03 '19 19:09 droazen

@droazen I'm loathe to do that because the new behavior of Mutect2 is objectively more correct as far as calls are concerned. It's just that for now it doesn't play nicely with the annotation engine. I can make the annotations happen a lot faster than 1-2 months if it's a priority for you, @freeseek.

Is this going to be code shared by the HaplotypeCaller as well?

This and other recent improvements are currently just for Mutect2 but I have tried to keep the software engineering respectable enough to transfer them trivially to HaplotypeCaller given the green light to do so.

davidbenjamin avatar Sep 04 '19 01:09 davidbenjamin

My main issue now is that I need both a fix (or workaround) for https://github.com/broadinstitute/gatk/issues/6045 and an annotation engine that counts fragments rather than reads. What I am afraid of is that even if vruano finds a fix for the other bug, I would still need to have it backported to GATK 4.0.12.0 which is the latest release currently capable of counting fragments.

I would be okay with just dropping evidence if necessary. Is there a way to drop one of two pair ended reads (regardless of whether they overlap or not) on the fly?

freeseek avatar Sep 04 '19 01:09 freeseek

@davidbenjamin Pinging you on this one -- any thoughts on how we could resolve this?

droazen avatar Jan 21 '20 17:01 droazen

@droazen How hard would it be to have a plug-in framework for fragment-based likelihoods parallel to what we already have for read likelihoods? It would be nice to specify all annotations with -A. Barring that, it won't be hard to write some fragment-based likelihoods and hard-code them into M2 and HC, but it seems clumsy to do so.

davidbenjamin avatar Jan 21 '20 17:01 davidbenjamin

@davidbenjamin If it's just a matter of adjusting what's passed in to the annotation framework, it shouldn't be that hard, but you probably have a better sense than I do of what needs to be done here.

droazen avatar Jan 21 '20 18:01 droazen

It ought to be totally parallel to the current framework, but with an AlleleLikelihoods<Fragment, Allele> replacing AlleleLikelihoods<GATKRead, Allelle>. I don't see any other change to the interface. It would be nice to do this without duplicating the class hierarchy of INFO and FORMAT annotations, however.

davidbenjamin avatar Jan 21 '20 18:01 davidbenjamin

Can I ask what the current status of this bug is?

cwhelan avatar Jul 22 '22 15:07 cwhelan

@davidbenjamin and/or @jamesemery, could you please comment?

droazen avatar Jul 22 '22 16:07 droazen

After digging around a little I see that a FragmentDepthPerAlleleBySample annotator now exists which produces an FAD format annotation for fragment depth. Running this on @freeseek 's example, adding -A FragmentDepthPerAlleleBySample, gives

GT:AD:AF:DP:F1R2:F2R1:FAD:SB 0/1:6,21:0.808:27:3,6:0,8:3,16:2,4,12,9

With fragment allelic depths of 3 for the reference and 16 for the alt. This could be the desired result if one of the 4 reference fragments was uninformative -- checking the bamout I see that one fragment has one read that supports the ref with low quality but the other read supports the alt with higher quality.

cwhelan avatar Jul 26 '22 13:07 cwhelan