Minimac4
Minimac4 copied to clipboard
Recalculating Rsq Imputation Quality Values
Hi,
First of all, thanks a lot for developing Minimac4, it is a great tool.
I have an issue regarding recalculating (and replicating) Rsq (imputation quality) values. I am interested in this because I want to recalculate/update Rsq values per variant for a subset of samples in a new VCF.
As far as I know, the formula should be: https://genome.sph.umich.edu/wiki/Minimac3_Info_File#Rsq and the actual code is this I think: https://github.com/statgen/Minimac4/blob/8c1641e43b63d005e3a6d12901d29ccb5ad697b1/src/ImputationStatistics.cpp (the part that starts with "double ImputationStatistics::Rsq(int marker)").
I recently started using Minimac4 available on Michigan Imputation Server and realized that different than before, now they also output HDS (Estimated Haploid Alternate Allele Dosage) values. I made use of these values, and come up with below bcftools/awk one-liner to calculate Rsq values:
bcftools query -f "%ID\t%AF\t%R2[\t%HDS]\n" chr1.dose.vcf.gz | sed 's/,/\t/g' | awk '{diffSumSq = "" ; for(i=4; i<=NF; i++) diffSumSq += ($i - ($2 + 1e-30))^2 ; printf $1"\t"$2"\t"$3"\t""%.5f\n",(diffSumSq/(NF-3)/(($2 + 1e-30)*(1 - ($2 + 1e-30))))}'
Here I first extract these information from the VCF: ID, AF and R2 (from INFO field) and HDS values from sample fields. Then I convert comma-separated HDS values into tab-separated values, so for instance if I have 1000 samples (and 1000 HDS values), now I have 2000 tab-separated HDS values for the whole cohort.
Then with awk, I substract AF from each HDS values and take the square of this, followed by summing all these values ("diffSumSq"), then I divide this by number of HDS values (2n) times AF x (1-AF). To avoid problems of AF being 0, I add 1e-30 to AF values as shown in ImputationStatistics.cpp.
I follow a very similar approach in R too and the resulting new Rsq values are the same as in the bcftools/awk approach.
When done on the same samples, I also replicate the original Rsq values (or I reach to a number that is very close to the original Rsq), however when AF is low, I am obtaining very different Rsq values than the originals. This is probably caused by p x (1-p), that is creating a very small number.
In short, is there a way to recalculate Rsq values correctly on a subset of samples? Am I doing a mistake here?
Thanks in advance!
Aaahhh, if I understand correctly, you are saying that for rare variants you are not being able to replicate the Rsq values that you see in the VCF file. That's mainly because internally when we calculate the Rsq from the dosages, those dosages are floating point numbers so they have higher precision. When you are calculating from the VCF file, you are calculating them based on rounded off (to 3 decimal places) numbers, hence at lower frequencies, the discrepancy comes up. I haven't verified your formula, but if your Rsq matches with the minimac Rsq at ALL common variants, then it is surely correct.
We can think about adding a parameter to control the precision output in minimac4 dosage files. But generally we have found that for association test purposes, information beyond the 3rd place of decimal doesn't really help much. Only adds to the size of the file. Please let me know if this makes sense and if you have any other questions.
Regards, Sayantan Das,
23andMe
On Fri, Nov 8, 2019 at 8:05 AM Fahri Küçükali [email protected] wrote:
Hi,
First of all, thanks a lot for developing Minimac4, it is a great tool.
I have an issue regarding recalculating (and replicating) Rsq (imputation quality) values. I am interested in this because I want to recalculate/update Rsq values per variant for a subset of samples in a new VCF.
As far as I know, the formula should be: https://genome.sph.umich.edu/wiki/Minimac3_Info_File#Rsq and the actual code is this I think: https://github.com/statgen/Minimac4/blob/8c1641e43b63d005e3a6d12901d29ccb5ad697b1/src/ImputationStatistics.cpp (the part that starts with "double ImputationStatistics::Rsq(int marker)").
I recently started using Minimac4 available on Michigan Imputation Server and realized that different than before, now they also output HDS (Estimated Haploid Alternate Allele Dosage) values. I made use of these values, and come up with below bcftools/awk one-liner to calculate Rsq values:
bcftools query -f "%ID\t%AF\t%R2[\t%HDS]\n" chr1.dose.vcf.gz | sed 's/,/\t/g' | awk '{diffSumSq = "" ; for(i=4; i<=NF; i++) diffSumSq += ($i - ($2 + 1e-30))^2 ; printf $1"\t"$2"\t"$3"\t""%.5f\n",(diffSumSq/(NF-3)/(($2
- 1e-30)*(1 - ($2 + 1e-30))))}'
Here I first extract these information from the VCF: ID, AF and R2 (from INFO field) and HDS values from sample fields. Then I convert comma-separated HDS values into tab-separated values, so for instance if I have 1000 samples (and 1000 HDS values), now I have 2000 tab-separated HDS values for the whole cohort.
Then with awk, I substract AF from each HDS values and take the square of this, followed by summing all these values ("diffSumSq"), then I divide this by number of HDS values (2n) times AF x (1-AF). To avoid problems of AF being 0, I add 1e-30 to AF values as shown in ImputationStatistics.cpp.
I follow a very similar approach in R too and the resulting new Rsq values are the same as in the bcftools/awk approach.
When done on the same samples, I also replicate the original Rsq values (or I reach to a number that is very close to the original Rsq), however when AF is low, I am obtaining very different Rsq values than the originals. This is probably caused by p x (1-p), that is creating a very small number.
In short, is there a way to recalculate Rsq values correctly on a subset of samples? Am I doing a mistake here?
Thanks in advance!
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/statgen/Minimac4/issues/26?email_source=notifications&email_token=AB5YQCB5K7EIEZRSGQB5WP3QSWE45A5CNFSM4JKZUHDKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HYABRNQ, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5YQCHUIPSEIRV7ACHHYM3QSWE45ANCNFSM4JKZUHDA .
Hi Sayathan, thanks a lot for your answer, it is very clear!
Indeed, mainly for rare variants I couldn't replicate the Rsq values I see in the imputation output VCFs. And yes, I had to calculate them based on rounded off (3 decimal places) values I see in the VCF. Speaking of this, just to confirm, for Rsq calculation you use HDS and not DS, right?
At a first glance, for chr22 imputation (n=519190 variants) results of an example n=~3000 samples cohort, I have obtained the below stats between my Rsq and the original Rsq values:
For AF > 0.01 variants, the biggest difference is 0.03232 (occurs only for 1 variant, and the second and the third biggest differences are 0.00118 and 0.00071), the smallest difference is 0. Median difference is as small as 7e-05 and it looks very good overall.
For 0 < AF < 0.01 variants, the max difference is 0.04573. Median difference is 3e-05. It is still good I think.
For AF = 0 variants, I have terrible Rsq scores that are as big as 1e+23 and they are all > 1e+20.
So indeed, the rarer it is, the more problematic it is. Actually for AF=0 variants I have very inflated R2 values as you can see (even bigger than >1). How do you avoid this? Because in the original Rsq formula it is indicated that the divisor is actually p x (1-p) where p is being the alternate allele frequency; and if p is very rare (let's say 1e-5) the divisor will be extremely small, resulting in a highly inflated Rsq score. Isn't it the case?
Yes, ideally the R2 calculation needs to be done on HDS and not DS. But, if you assume HWE, then the values based on HDS and DS are similar.
So, the issue is mainly with rounding off dosages for low MAF variants and not the low AF itself. For e.g. if you p is small like 1e-5, then your numerator will be even smaller. In other words, you can mathematically show that the formula should always give you less than 1 value. One easy way to work around would be use estimate the allele frequency based on the data ? That way your R2 values might be slightly off than those from minimac4, but they should at least be less than 1 always.
Regards, Sayantan Das,
23andMe
On Fri, Nov 8, 2019 at 10:38 AM Fahri Küçükali [email protected] wrote:
Hi Sayathan, thanks a lot for your answer, it is very clear!
Indeed, mainly for rare variants I couldn't replicate the Rsq values I see in the imputation output VCFs. And yes, I had to calculate them based on rounded off (3 decimal places) values I see in the VCF. Speaking of this, just to confirm, for Rsq calculation you use HDS and not DS, right?
At a first glance, for chr22 imputation (n=519190 variants) results of an example n=~3000 samples cohort, I have obtained the below stats between my Rsq and the original Rsq values:
For AF > 0.01 variants, the biggest difference is 0.03232 (occurs only for 1 variant, and the second and the third biggest differences are 0.00118 and 0.00071), the smallest difference is 0. Median difference is as small as 7e-05 and it looks very good overall.
For 0 < AF < 0.01 variants, the max difference is 0.04573. Median difference is 3e-05. It is still good I think.
For AF = 0 variants, I have terrible Rsq scores that are as big as 1e+23 and they are all > 1e+20.
So indeed, the rarer it is, the more problematic it is. Actually for AF=0 variants I have very inflated R2 values as you can see (even bigger than
1). How do you avoid this? Because in the original Rsq formula it is indicated that the divisor is actually p x (1-p) where p is being the alternate allele frequency; and if p is very rare (let's say 1e-5) the divisor will be extremely small, resulting in a highly inflated Rsq score. Isn't it the case?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/statgen/Minimac4/issues/26?email_source=notifications&email_token=AB5YQCFE73FISAP7QRKNGBLQSWW33A5CNFSM4JKZUHDKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDS7UQY#issuecomment-551942723, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5YQCBPRXOK4ORHR6H6QXTQSWW33ANCNFSM4JKZUHDA .
Hi,
Can you please tell me how HDS is calculated (the formula) and what exactly estimated haploid allele dosage means? And what do the two comma-separated numbers reported under HDS mean (please refer to the screenshot)? Also, in the attached screenshot it can be seen that the posterior probability is maximum for the heterozygous genotype (0.457). Still, why is the imputed genotype reported as homozygous reference?
Thanks in advance.

HDS is the sum of posterior probabilities (see equation 4 in https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5157836/) for templates with alternate alleles for that variant. This paper is for Minimac3. While Minimac4 has minor differences, the basic concept for calculating HDS is the same. Each value of HDS corresponds to a haplotype for an individual. You can think of each value as the probability of the corresponding haplotype having the alternate allele.
The hard-called genotypes are simply rounded HDS values. The GP values are derived from HDS and not used for calling phased GT (since you need haplotype probabilities, instead of genotype probabilities, to call phased GT). GP is optionally generated for convenience in running downstream analyses that require GP. The formulas for GP can be found at https://github.com/statgen/hds-util/blob/master/README.md#diploid.
Estimated haploid allele dosage means that the probability of the haplotype carrying the alternative allele, given the observed (pre-phased) array genotype and the reference panel.
In your screenshot, HDS=0.367,0.337 (previously we present it as 0.367|0.337 to emphasize the phasing status).
• GT: if HDS<0.5 then 0, otherwise 1. Note that in this case both of the probability is <0.5 so the GT=0|0. • DS is the sum of HDS, so DS=0.367+0.337. We preserve 3 digits only, so the underlying HDS might be 0.3674, 0.3373 leading to the sum to be 0.705 instead of 0.704. • GP is also calculated based on HDS. In your case, P(0|0)=0.419, P(0|1 or 1|0)=P(0|1)+P(1|0)=0.457, P(1|1)=0.124. Note that 0.457 represents a probability which does not consider the phasing status, whereas the inputs and outputs of minimac4 are phased. P(0|1)=(1-0.367)0.337<0.417, P(1|0)=0.367(1-0.337)<0.417, so it makes sense that GT is 0|0.
You may refer to the minimac3 paper (doi: 10.1038/ng.3656) for the formula. Briefly, the calculation is based on a hidden Markov model (HMM), where the hidden state represents the haplotype template from the reference panel and the emission state represents the observed genotype. HMM outputs the posterior probability of each reference haplotype being the template at each variant for this target haplotype, and then for each variant the HDS is an average of reference haplotypes weighted by posterior probability. Minimac4 and minimac3 differ by how we reduce the hidden state space, but does not affect the understanding of the basic logic of the imputation method.
-- Ketian Yu, M.S. PhD candidate | Department of Biostatistics University of Michigan, Ann Arbor MI She | Her | Hers On Dec 12, 2022, 11:32 PM -0800, ChakrabortyShreya @.***>, wrote:
Hi, Can you please tell me how HDS is calculated (the formula) and what exactly estimated haploid allele dosage means? And what do the two comma-separated numbers reported under HDS mean (please refer to the screenshot)? Also, in the attached screenshot it can be seen that the posterior probability is maximum for the heterozygous genotype (0.457). Still, why is the imputed genotype reported as homozygous reference? Thanks in advance. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>