gatk icon indicating copy to clipboard operation
gatk copied to clipboard

Problem in gvcf interplay between clair3 and gatk

Open johannesgeibel opened this issue 10 months ago • 1 comments

I realized an issue, when feeding clar3 gvcf files into GenotypeGVCFs of GATK for cohort calling. The SNP where it became obvious is a SNP likely located on one copy of a large duplication (shown by increased coverage of the region and inconsistent haplotypes), and thus having a biased allele ratio, but clearly having >8% T, as the cutoff would be:

image

The gVCF record of clair3 for this SNP looks as follows:

10      23761033        .       G       <NON_REF>       0       .       END=23761033    GT:GQ:MIN_DP:PL ./.:0:68:245,0,1385

Clair3 assigns a GQ of zero to that SNP, as I would also have. However, the likelihood for a heterozygote is still higher than for one of the homozygous states. Now, as the FORMAT field is handled as for a variant without possible alternative alleles, information is lost for GATK. A PASS variant for comparison here:

10      23761040        .       T       A,<NON_REF>     19.22   PASS    F       GT:GQ:DP:AD:AF:PL  0/1:19:68:27,38,0:0.5588:28,0,45,990,990,990

GATK then strangely does following. It does not respect the GQ=0 from clair3, but assigns 99, probably as the variant was discovered in other samples as well. Additionally, it does not have all information on the AD, thus resulting in a 1/1 call, without coverage for the alternative allele..

10      23761033        .       G       T       1093.24 .       [...]   GT:AD:AF:DP:GQ:PL 0/1:68,0:.:68:99:245,0,1385 [...]

Guess, solving the problem needs input from both tools. Clair3 and GATK, if I am correct. If I understood the issue correctly, clair3 should output AD information for more sites. Further, GATK should respect the GQ information of clair3 in this case.

I posted the issue at the clair3 page as well: https://github.com/HKU-BAL/Clair3/issues/354.

Thanks, Johannes

johannesgeibel avatar Jan 06 '25 15:01 johannesgeibel