gatk
gatk copied to clipboard
Mutect2: drop in SNV specificity/precision between v.4.1.8.1 and 4.1.9.0
We noticed a strong drop in precision for SNVs (~10% for tumor-normal mode, ~20% in tumor-only mode) between releases 4.1.7.0 and 4.2.6.0. With more testing using the HCC1395 somatic benchmark (https://pubmed.ncbi.nlm.nih.gov/34504347/) and sequencing data provided by the Somatic Mutation Working Group (Fudan university WES tumor-normal data set, 2 x 100x coverage), the drop in performance can be traced to changes between 4.1.8.1 and 4.1.9.0. Here are the performance metrics for selected gatk releases:
The calling was done with essentially default parameters:
tools/gatk-${version}/gatk Mutect2 --normal-sample WES_FD_N --output $outvcf --intervals $wesbed --interval-padding 0 --input $inbam_t --input $inbam_n --reference $ref
tools/gatk-${version}/gatk FilterMutectCalls --output ${outvcf%.vcf}_filtered.vcf --variant $outvcf --intervals $wesbed --reference $ref --stats ${outvcf}.stats --threshold-strategy OPTIMAL_F_SCORE --f-score-beta 1.0
som.py was used for calculating performance metrics.
Curiously, we do not observe a such a substantial drop in precision in WGS data, neither in tumor-only nor in tumor-normal mode. In the foillowing, our "v04" corresponds to gatk 4.1.7.0 and out "v05" corresponds to gatk 4.2.6.0:
Tumor-normal:
Tumor-only:
In my opinion, the small gains in recall between 4.1.8.1 and 4.1.9.0. do not justify the drop in precision. This and the fact that WES data is affected, while WGS data is not, suggests that this might be a bug rather than a feature.
Any suggestions on how to get the WES precision back to v4.1.8.1 levels are appreciated.
Thanks in advance,
Dmitriy
From what I have heard from Dmitriy, this loss in Mutect2 precision between Mutect2 v.4.1.8.1. and v.4.1.9.0 seems to be perfectly reproducible using default parameters on the HCC1395 somatic benchmark gold-standard from the Sequencing Quality Control Phase II Consortium. The drop in performance seems to still persist in the current Mutect2 version.
If true, then this could have dramatic affects on Mutect2 clinical variant calling quality since v.4.1.9.0 until now. It would appear that isolating the root cause of this drop in performance has high importance for maintaining the clinical validity of Mutect2.
It seems quite a number of Mutect2 code details have changed between v.4.1.8.1. and v.4.1.9.0, as for instance this commit concerning quality filtering, or this one concerning soft-clipped bases. Perhaps the circumstance of the drop in precision affecting WES but not WGS data may point to the culprit?
Also, may I ask whether the GATK team is doing real-world regression test on gold-standard variant callsets between version releases, and have these tests passed between v.4.1.8.1. and v.4.1.9.0?
Glad to see we're not the only one's noticing this. We have been using 4.0.11.x for quite a while with success. During a recent revalidation effort, we have been working with 4.2.x versions and missing quite a few calls we consider "truth", while making these calls with other variant callers. We have since verified v4.1.8.0 is performing acceptably, so our data supports the issue occurring during the 4.1.9.0 update.
Same here, @jhl667 - good to get additional confirmation. May I ask whether you used the same gold-standard call set or another one, and whether yours was WGS or WES?
I’m linking @droazen here as well since he acted as the release manager of the affected releases. I think it would be good to get clarity soon, since probably tens of thousands of clinical samples have been processed with the affected versions in the last two years, and a reduction in precision of 10-20% (absolute) may have been relevant for clinical decisions in at least some of these cases.
Also, I know how hard it is to avoid such things in what is essentially research software (and such a great one to boot), so this is not about blaming anyone but about fixing the root cause as soon as possible. If it turns out that the issue only affects this particular reference call set and no patients (for some arcane reason), then all the better in my view.
@schelhorn We have a large clinical "truth" set we utilize during workflow validations. We also utilize spike-in samples from SeraCare and perform dilutions using a couple of the common Coriell cell lines. We noticed the Mutect2 calling inconsistency while validating a small targeted panel.
To clarify our experimental setup: for the benchmark we are using the latest release (v.1.2) somatic "ground truth" of the HCC1395 cell line from
https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/seqc/Somatic_Mutation_WG/release/latest/
It contains ~40k SNVs and ~2k INDELs in ~2.4Gb high-confidence regions. In high-confidence regions intersected with WES bed, we still have ~1.1k SNVs and ~100 INDELs.
Therefore, even in the WES analysis scenario, the SNV counts should be high enough to draw reliable conclusions when comparing performance between different callers and releases. For WES INDELs, the counts are indeed rather low and results should be interpreted with caution.
I think this issue is very important to me. When I deal with the CAP data (high capture cfDNA data), GATK4.1.0.8's performance is much better than GATK4.2 ...
Thanks for the report! I am contacting the lead developer of Mutect2, who will reply here.
Hi @droazen and @davidbenjamin, could you please give us an estimate about when you will be able to look at this issue in more detail to identify the root cause? Thank you.
@davidbenjamin is away at the moment, but should be able to look into this issue when he returns next week
Hi @droazen, is there perhaps already any news on the matter? It’s been a month, and we are wondering about when this may be looked into. Thanks!
@schelhorn @davidbenjamin has been out for most of this past month, but he's promised to look into this report as soon as he can find the time. I anticipate an initial reply from him next week. Thanks for your patience.
I'll add that much of our recent development work on the somatic variation front has gone into a new ML-based tool we're calling "Mutect3" (but which may end up with a different name once it's released). This new tool is already giving much better results than Mutect2
, but is not quite ready for prime time. I expect it to come out this year, however. A side effect of the focus on "Mutect3" development work has been that some reports of issues in Mutect2
have taken longer than usual to address, unfortunately.
@ddrichel @jhl667 @schelhorn @xiucz I have a theory that, depending on the answers to the following questions, I may pursue further.
- What are you using for a panel of normals in both WES and WGS?
- What reference assembly are you using?
- Were the new false positives present in the unfiltered Mutect2 VCF before version 4.1.9.0 (and then flagged by FilterMutectCalls) or absent entirely?
Hi David,
- Here, we must distinguish between the WES minimal example (default parameters, command lines in my original bug report) and our "production pipeline".
In the minimal example, we do not use PON.
In the production pipeline, we use the exome PON file from: gs://gatk-best-practices/somatic-hg38/1000g_pon.hg38.vcf.gz as recommended here: https://gatk.broadinstitute.org/hc/en-us/articles/360035890631-Panel-of-Normals-PON- In the production pipeline, we also use gnomAD 3.1.2, filtered for AF>0 and FILTER=PASS as "germline resource".
Some more data from the production pipeline, just in case it could help. Since the PON is exome-only, we were not sure whether including the PON would improve the performance of our WGS workflow in the "production pipeline" with gatk version 4.1.7.0, so we ran experiments:
- calling without germline resource and without PON
- calling with germline resource only, without PON
- calling with both germline resource and PON
The results:
We also conducted experiments with a subset of the WES FD sample (FD_1, 1/3 of the full ~100x FD dataset):
- calling without germline resource and without PON
- calling with PON only, without germline resource
- calling with germline resource only, without PON
- calling with both germline resource and PON
The results (please note that the labeling conventions are now different compared to WGS experiments, apologies for the inconvenience):
-
The reference is GRCh38.primary_assembly.genome.fa The benchmark ("gold standard") call set contains only variants on chr1-chr22, which AFAIK are identical or almost identical between the different b38 versions.
-
In my minimal WES example, most of the new FPs (150/158) from v4.1.9.0 are not present in raw, unfiltered calls from v.4.1.8.1
This was found the following way. Compare FPs by CHROM, POS, REF, ALT using the scratch output of som.py (ran with --keep-scratch):
comm -23 <(bcftools query -f "%CHROM %POS %REF %ALT\n" WES_FD_TN_4190_filter_som_py/fp.vcf.gz | sort) <(bcftools query -f "%CHROM %POS %REF %ALT\n" WES_FD_TN_4181_filter_som_py/fp.vcf.gz | sort) | sed 's/ /\t/g' > new_FPs.txt
wc -l new_FPs.txt
158 new_FPs.txt
Find new FPs present in the raw output of v.4.1.8.1, again matching by CHROM, POS, REF, ALT:
awk 'NR==FNR{a[$1"_"$2"_"$3"_"$4]=1; next;}{if(substr($0,1,1)!="#" && a[$1"_"$2"_"$4"_"$5]==1) print $0}' new_FPs.txt WES_FD_TN_4181.vcf | wc -l
8
For the 8 new FPs that have been detected, but filtered in v.4.1.8.1 (all of which are SNVs), find FILTER classification:
awk 'NR==FNR{a[$1"_"$2"_"$3"_"$4]=1; next;}{if(substr($0,1,1)!="#" && a[$1"_"$2"_"$4"_"$5]==1) print $0}' new_FPs.txt WES_FD_TN_4181_filtered.vcf | cut -f7 | sort | uniq -c
2 clustered_events
1 normal_artifact
1 strand_bias
4 weak_evidence
Hope this helps,
Dmitriy
Thank you @ddrichel! That's really helpful and seems to support my suspicion.
The most relevant changes to Mutect2 between versions 4.1.8.1 and 4.1.9.0 are #6520, #6696, and #6821, which have in common the property that they are bug fixes restoring sensitivity in edge cases. I don't think that any of these are regressions because the job of Mutect2 is to output any candidate variant. Filtering is the responsibility of FilterMutectCalls. Because our panel of normals is generated by running Mutect2, a panel of normals generated before 4.1.9.0 is blind to artifacts that occur in these edge cases. I suspect that re-generating our panels of normals using a more recent release of Mutect2 will solve the problem.
I'm going to need to see how long it will take to gather the samples to re-generate our hg38 panel of normals (it might not take long at all) and triage that against the timetable for Mutect3, which does away with the panel of normals entirely.
Excellent, thank you @davidbenjamin and @ddrichel!
- Since it has been three weeks, are there any news on whether the regenerated PoN works, and where from it is available?
- Also, since there are likely many people (some of which have spoken up in the thread) who have been using the affected releases throughout the years, possibly affecting patients' diagnoses, does the Broad have a communication plan to notify Mutect2 users of this serious performance degradation in the specific cases you outlined?
- Last, are there plans to incorporate fully functional benchmark regression tests such as @ddrichel's into the Mutect2 (and later Mutect3) CI pipeline, or at least into the build process for each version that is released to the public?
- What are you using for a panel of normals in both WES and WGS?
- What reference assembly are you using?
- Were the new false positives present in the unfiltered Mutect2 VCF before version 4.1.9.0 (and then flagged by FilterMutectCalls) or absent entirely?
Hi, my apologies for not answering these questions sooner. We are utilizing a panel of normals that was created in house from about 40 or so samples. Reference assembly is still hg19. In 4.1.9.0, we noticed multiple high confidence variants that were no longer being called using equivalent parameter sets. So anyways, in our case we are dealing with a false negative problem.
Hi @davidbenjamin,
may I ask whether there has been any progress on addressing the bug and/or generating the new PoN? Thank you.
Sorry for not letting this go, but this a severe bug with repercussions on clinical diagnostics that still has not been communicated to other users, nor fixed, for more than three months.
@droazen, is there anything else that we could do to hasten this along? Is there any workaround other than downgrading Mutect2 to the version from more than two years ago?
I am going to be able to identify samples for a new panel of normals this week, after which generating and validating the panel will take another few days.
Excellent news, thank you @davidbenjamin!
Happy New Year, @davidbenjamin! Coming back to your comment from October, did you perhaps have the chance to look at the new panel of normals and whether is solves the issue? It has been almost exactly half a year now, since we reported the bug, and given the severity of the issue regarding clinical applications it would be important to get a better understanding of whether the updated PON fixes it. Thanks a bunch!
@schelhorn We also develop clinical workflows, and came to the conclusion we would hard-stop at 4.1.8.0 and wait on Mutect3 to be ready for primetime. My only comment is that it would be strange to me if a PON update solved this, considering we use a custom in-house PON...
@jhl667, yep, I understand. Still, this issue is too big to be left alone. I think the Broad has to act here, since patient's lives are at stake and we didn't receive any actionable response for half a year. Mutect2 has a good reputation, and the Broad profits from that, but it is also medical software and it should be treated as such. Mutect2 will continue to be used for some time, and this has to be fixed.
@droazen, is there anything else you people can do? Is the wider GATK dev team aware of the issue? Is this something I should escalate to someone else so that you get support from your management to fix it? We have pharma collabs with the Broad in place, I could try doing it that way, or via the genomics/pharma community. I'm not trying to be threatening or anything, just thinking out loud how we can help you to get the resources to solve this.
@schelhorn Sorry for the long wait on this issue, which we do take seriously! Unfortunately as mentioned above we have extremely limited resources for Mutect2 maintenance at the moment due to our focus on Mutect3, which we see as the long-term solution to this and other issues. You should definitely continue to run 4.1.8.0 until this is resolved, if at all possible. Having said that, we are generating a new M2 PON now, and should know soon whether it resolves this issue as @davidbenjamin hopes. Stay tuned for an update on the outcome of this evaluation within the next week or two.
@droazen +1 for being affected by this issue in production. As this is in production (same as @schelhorn , with big pharma which have very strict security requirements), and as 4.1.8.0 contains critical security vulnerabilities that were mitigated in subsequent releases, we are in a serious pickle here.
@jhl667 how did you conclude that 4.1.8.0 performs better than newer versions? What we see is that it emits more variants, but after filtering and intersecting with other callers (i.e. Strelka), we get more variants and a "better" result (we can't really define "better" - it's merely an observation) with 4.2.4.1.
@ury the performance evaluation is based on the HCC1395 benchmark, as described in my original report (see above). More variants in newer versions is in line with our analysis, but the new variants are likely to be mostly false positives.
@ddrichel so in our case it seems to be the other way around: 4.1.8.0 produces much more variants than 4.2.4.1. After filtering and intersection, it's the other way around.
Update: @davidbenjamin has generated a new panel of normals, and is in the process of running some analysis on it.
Okay, I have a new panel for hg38 here: gs://broad-dsde-methods-davidben/mutect2-2023-panel-of-normals/mutect2-hg38-pon.vcf.gz
It has all the variants of the old panel, plus more that arose in more recent versions of Mutect2. It is also generally somewhat more conservative, with a greater bias toward precision than the previous one.
This panel is intended to be used at your own risk. I can vouch that it doesn't wreck the results of our own validations but I do not have time to vet it thoroughly enough to put it in the best practices google bucket. Likewise, I cannot promise that it will improve specificity in any particular set of samples.
Within several months I hope we are all running the next version of Mutect and never need to see a panel of normals again.
@davidbenjamin @droazen unfortunately the new PON does not make up for the precision loss introduced in v4.1.9.0. In v4.4.0.0 we get just 2 fewer FP SNVs in our performance evaluation, compared to the old PON. Benchmarking results in WES tumor-normal mode, HCC1395 benchmark, and:
- v4.1.8.1 (last release with high SNV precision), v4.1.9.0 (first release affected by precision drop), v4.4.0.0 (current release)
- oldPON: 1000g_pon.hg38.vcf.gz, newPON: mutect2-hg38-pon.vcf.gz
Any chance to get this issue fixed? With Mutect3 not being available and v4.1.8.1 being affected by the log4j vulnerability, it is quite regrettable to be stuck with inferior precision.
Extended methods, code, and data to reproduce the issue are here: https://github.com/ddrichel/Mutect2_calling_performance_bug