gatk icon indicating copy to clipboard operation
gatk copied to clipboard

reproduction problem of HaplotypeCaller when different interval split

Open wangyugui opened this issue 6 years ago • 3 comments

Hi.

The VCF output file of HaplotypeCaller is not same when different interval split.

It seems that smaller region will product more var into vcf file. split1:chr unit 5,089,533 var in vcf split2: gatk4.SplitIntervals.sh(500 unit) 5,142,322 var in vcf

fastq files HiSeqX-PCR-free-v2.5-NA12878 70x from BaseSpace.

gatk version:4.0.3.0

By the way, there are some many diff site just in QD value. such as QD=28.13 vs QD=25.36, and other attr are same.

Reason: download sample without a fixed seed for every site? or https://github.com/broadinstitute/gatk/issues/4614? or others?

Best Regards

wangyugui avatar Mar 30 '18 08:03 wangyugui

Please post the exact command you ran @wangyugui. Also, we ask that these types of reports first go through the forum at https://gatkforums.broadinstitute.org/gatk/discussions. Thanks for reporting.

sooheelee avatar Mar 30 '18 13:03 sooheelee

java -jar /usr/hpc-bio/gatk/gatk-package-4.0.3.0-local.jar HaplotypeCaller -L /biowrk/BaseSpace/vcf.NA12878/vcf.gatk4.bwa.HiSeqX-PCR-free-v2.5-70x/tmp_dir/0000-scattered.intervals -O /biowrk/BaseSpace/vcf.NA12878/vcf.gatk4.bwa.HiSeqX-PCR-free-v2.5-70x/tmp_dir/gatk.1.vcf --arguments_file /biowrk/BaseSpace/vcf.NA12878/vcf.gatk4.bwa.HiSeqX-PCR-free-v2.5-70x/gatk.arguments_file.txt --verbosity WARNING --reference /usr/bio-ref/GRCh38.p0.dnaref/dnaref.fa --dbsnp /usr/bio-ref/GRCh38.p0.dnaref/dbsnp.vcf --max-reads-per-alignment-start 100

# cat /biowrk/BaseSpace/vcf.NA12878/vcf.gatk4.bwa.HiSeqX-PCR-free-v2.5-70x/gatk.arguments_file.txt
-I /biowrk/BaseSpace/bam.bwa/HiSeqX-PCR-free-v2.5-NA12878.70x/md.bam

wangyugui avatar Mar 30 '18 15:03 wangyugui

https://github.com/broadinstitute/gatk/blob/4.2.6.1/src/main/java/org/broadinstitute/hellbender/tools/walkers/annotator/QualByDepth.java#L79

The calculation formula of QD is qual / depth When QD exceeds 35, QualByDepth will adjust QD to a random number of 30±3

https://github.com/broadinstitute/gatk/blob/4.2.6.1/src/main/java/org/broadinstitute/hellbender/utils/Utils.java#L52

Although GATK uses a fixed random seed, Haplotypecaller will have reproducible results under the same input file and command. However, if the input file is divided into different parts, different random numbers will be obtained.

This issue even leads to the use of

gatk VariantAnnotator \
    -V haplotypecaller_result.vcf \
    -o recalcuate_qd.vcf \
    -A QualByDepth

the result obtained will be different from the result of Haplotypecaller.

@davidbenjamin Will this issue be fixed?

fo40225 avatar Aug 23 '22 10:08 fo40225