gatk
gatk copied to clipboard
reproduction problem of HaplotypeCaller when different interval split
Hi.
The VCF output file of HaplotypeCaller is not same when different interval split.
It seems that smaller region will product more var into vcf file. split1:chr unit 5,089,533 var in vcf split2: gatk4.SplitIntervals.sh(500 unit) 5,142,322 var in vcf
fastq files HiSeqX-PCR-free-v2.5-NA12878 70x from BaseSpace.
gatk version:4.0.3.0
By the way, there are some many diff site just in QD value. such as QD=28.13 vs QD=25.36, and other attr are same.
Reason: download sample without a fixed seed for every site? or https://github.com/broadinstitute/gatk/issues/4614? or others?
Best Regards
Please post the exact command you ran @wangyugui. Also, we ask that these types of reports first go through the forum at https://gatkforums.broadinstitute.org/gatk/discussions. Thanks for reporting.
java -jar /usr/hpc-bio/gatk/gatk-package-4.0.3.0-local.jar HaplotypeCaller -L /biowrk/BaseSpace/vcf.NA12878/vcf.gatk4.bwa.HiSeqX-PCR-free-v2.5-70x/tmp_dir/0000-scattered.intervals -O /biowrk/BaseSpace/vcf.NA12878/vcf.gatk4.bwa.HiSeqX-PCR-free-v2.5-70x/tmp_dir/gatk.1.vcf --arguments_file /biowrk/BaseSpace/vcf.NA12878/vcf.gatk4.bwa.HiSeqX-PCR-free-v2.5-70x/gatk.arguments_file.txt --verbosity WARNING --reference /usr/bio-ref/GRCh38.p0.dnaref/dnaref.fa --dbsnp /usr/bio-ref/GRCh38.p0.dnaref/dbsnp.vcf --max-reads-per-alignment-start 100
# cat /biowrk/BaseSpace/vcf.NA12878/vcf.gatk4.bwa.HiSeqX-PCR-free-v2.5-70x/gatk.arguments_file.txt
-I /biowrk/BaseSpace/bam.bwa/HiSeqX-PCR-free-v2.5-NA12878.70x/md.bam
https://github.com/broadinstitute/gatk/blob/4.2.6.1/src/main/java/org/broadinstitute/hellbender/tools/walkers/annotator/QualByDepth.java#L79
The calculation formula of QD is qual / depth When QD exceeds 35, QualByDepth will adjust QD to a random number of 30±3
https://github.com/broadinstitute/gatk/blob/4.2.6.1/src/main/java/org/broadinstitute/hellbender/utils/Utils.java#L52
Although GATK uses a fixed random seed, Haplotypecaller will have reproducible results under the same input file and command. However, if the input file is divided into different parts, different random numbers will be obtained.
This issue even leads to the use of
gatk VariantAnnotator \
-V haplotypecaller_result.vcf \
-o recalcuate_qd.vcf \
-A QualByDepth
the result obtained will be different from the result of Haplotypecaller.
@davidbenjamin Will this issue be fixed?