Minimac4
Minimac4 copied to clipboard
New minimac4 release v4.1.0 produces vcf.gz which can't be read by DosageConvertor
Previously, we were able to use minimac dosage vcf files with DosageConvertor, to convert output files to a plink dosage format. However, when using the new minimac4 version to produce outputs in vcf.gz format, DosageConvertor can no longer read these files properly and reports the following error : Tag DS does not exist for . (See full trace below)
Here is the example of how I call minimac4.1 and DosageConvertor now (using DosageConvertor built from the latest codebase):
minimac4
chr18.1000g.Phase3.v5.With.Parameter.Estimates.msav
23andme_chr18_eagle.vcf.gz
-o 23andme_chr18_minimac4.dose.vcf.gz
--output-format vcf.gz
DosageConvertor --vcfDos 23andme_chr18_minimac4.dose.vcf.gz
--prefix 23andme_chr18
--type plink
--format 1
--tag DS
Previously, I called minimac like this:
minimac4
--refHaps chr18.1000g.Phase3.v5.With.Parameter.Estimates.m3vcf.gz
--haps 23andme_chr18_eagle.vcf.gz
--prefix 23andme_chr18_minimac4
And here is the full log excerpt from the DosageConvertor part:
Version: 1.0.0;
Built: Mon Oct 17 04:33:09 UTC 2022 by root
Command Line Options:
Input Dosage : --vcfDose [23andme_chr18_minimac4.dose.vcf.gz],
--info [], --tag [DS]
Output Parameters : --prefix [23andme_chr18], --format [1], --type [plink],
--nobgzip
Other Parameters : --buffer [10000], --allDiploid, --trimNames,
--trimLength [100], --SexFile [], --idDelimiter [],
--help, --params
PhoneHome : --noPhoneHome, --phoneHomeThinning [50]
URL = http://genome.sph.umich.edu/wiki/DosageConvertor
GIT = https://github.com/Santy-8128/DosageConvertor
WARNING !!! Parameter --buffer will be ignored when be used with --type "plink" !
------------------------------------------------------------------------------
INPUT VCF DOSAGE FILE
------------------------------------------------------------------------------
Detecting Dosage File Type ...
Format = VCF (Variant Call Format)
Calculating number of Samples in VCF File ...
Calculating number Markers in VCF File ...
Number of Samples read from VCF Dosage File : 39
Number of Markers read from VCF Dosage File : 1333625
Number of Imputed Variants : 1333624
Number of Genotyped and Imputed Variants : 1
Number of Genotyped Only Variants : 0
Loading Dosage Data from VCF File : 23andme_chr18_minimac4.dose.vcf.gz
Reading and Importing Data from VCF File ...
Writing to PLINK output file ...
------------------------------------------------------------------------------
ERROR !!!
------------------------------------------------------------------------------
Tag DS does NOT exist for .
Please verify the input file(s) !
Type --help for usage details
Contact author [[email protected]] if you still need help ...
Program Aborting ...
Is there a problem with the dosage file that the new minimac version creates? I also created an issue in the DosageConvertor codebase in case it is something which needs to be changed there. Thanks for your assistance.
I'm surprised that DosageConverter doesn't support HDS fields (which is now the default format field for Minimac v4.1), but you can get around this by adding -f DS
to your minimac4
command.
Great, adding this flag does let DosageConvertor read the file now!
I have another problem though, the dosage file that minimac4 produces now (with that flag -f DS) does not export the ID field, they are all set as '.' which causes problems for me later in the analysis. Is there a way to force that to be exported - it used to set the ID to chr:pos:alt:ref etc?
Thanks again
Can you elaborate on the use case for your downstream analysis. This information is redundant with what is already in the VCF records. Note: Minimac v4.1 uses the ID column from the reference panel in the imputed results.
Yes of course. Previously, the dosage file produced from minimac 1.0.3:
head -n 20 23andme_chr5_minimac4.dose.vcf
##fileformat=VCFv4.1
##filedate=2022.10.21
##source=Minimac4.v1.0.3
...
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Estimated Alternate Allele Dosage : [P(0/1)+2*P(1/1)]">
##minimac4_Command=minimac4 --cpus 2 --refHaps chr5.1000g.Phase3.v5.With.Parameter.Estimates.m3vcf.gz --haps 23andme_chr5_eagle.vcf.gz --minRatio 0.01 --prefix 23andme_chr5_minimac4
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 1_01A 2_01B 3_02A 4_02B 5_03A 6_03B 7_04A 8_04B 9_05A 10_05B 12_06B 13_07A 14_07B 15_08A 16_08B 17_09A 18_09B 19_10A 20_10B 21_11A 22_11B 23_12A 24_12B25_13A 26_13B 27_14A 28_14B 29_15A 30_15B 31_16A 32_16B 33_17A 34_17B 35_18A 36_18B 37_19A 38_19B 39_20A 40_20B
5 10043 5:10043:T:A T A . PASS AF=0.00242;MAF=0.00242;R2=0.01670;IMPUTED GT:DS 0|0:0.010 0|0:0.010 0|0:0.002 0|0:0.002 0|0:0.002 0|0:0.002 0|0:0.002 0|0:0.002 0|0:0.040 0|0:0.040 0|0:0.001 0|0:0.004 0|0:0.004 0|0:0.002 0|0:0.002 0|0:0.007 0|0:0.007 0|0:0.010 0|0:0.010 0|0:0.002 0|0:0.002 0|0:0.001 0|0:0.001 0|0:0.002 0|0:0.002 0|0:0.002 0|0:0.002 0|0:0.001 0|0:0.001 0|0:0.002 0|0:0.002 0|0:0.002 0|0:0.002 0|0:0.001 0|0:0.001 0|0:0 0|0:0 0|0:0.003 0|0:0.003
5 10055 5:10055:T:A T A . PASS AF=0.00112;MAF=0.00112;R2=0.00050;IMPUTED GT:DS 0|0:0.004 0|0:0.004 0|0:0.002 0|0:0.002 0|0:0.002 0|0:0.002 0|0:0.003 0|0:0.003 0|0:0.002 0|0:0.002 0|0:0.002 0|0:0.004 0|0:0.004 0|0:0.002 0|0:0.002 0|0:0.002 0|0:0.002 0|0:0.002 0|0:0.002 0|0:0.003 0|0:0.003 0|0:0.001 0|0:0.001 0|0:0.003 0|0:0.003 0|0:0.002 0|0:0.002 0|0:0.002 0|0:0.002 0|0:0.003 0|0:0.003 0|0:0.002 0|0:0.002 0|0:0.001 0|0:0.001 0|0:0 0|0:0 0|0:0.003 0|0:0.0
Now v4.1.0:
head -n 20 23andme_chr9_minimac4.dose.vcf
##fileformat=VCFv4.2
##filedate=20221020
##source=Minimac v4.1.0
...
##INFO=<ID=IMPUTED,Number=0,Type=Flag,Description="Marker was imputed">
##INFO=<ID=TYPED,Number=0,Type=Flag,Description="Marker was genotyped">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Estimated Alternate Allele Dosage : [P(0/1)+2*P(1/1)]">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 1_01A 2_01B 3_02A 4_02B 5_03A 6_03B 7_04A 8_04B 9_05A 10_05B 12_06B 13_07A 14_07B 15_08A 16_08B 17_09A 18_09B 19_10A 20_10B 21_11A 22_11B23_12A 24_12B 25_13A 26_13B 27_14A 28_14B 29_15A 30_15B 31_16A 32_16B 33_17A 34_17B 35_18A 36_18B 37_19A 38_19B 39_20A 40_20B
9 10163 . CT G . . AF=0;MAF=0;AVG_CS=1;R2=0;IMPUTED DS 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9 10273 . AAC G . . AF=0;MAF=0;AVG_CS=1;R2=0;IMPUTED DS 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9 10327 . G G . . AF=0;MAF=0;AVG_CS=1;R2=0;IMPUTED DS 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Looking here you can see that minimac previously used to fill in the ID column in the vcf. This is needed for me because when I use DosageConvertor to convert the dosage file to plink format, it uses the ID column from the vcf to create the plink map file with the marker IDs, and so now all the IDs in the map file (2nd column below) are empty ('.') (DosageConvertor produces a .plink.dosage, .plink.fam and .plink.map file)
head 23andme_chr9.plink.map
9 . 0 10163
9 . 0 10273
9 . 0 10327
9 . 0 10329
9 . 0 10362
9 . 0 10469
I am unable to use the plink files due to not having unique variant identifiers. Hope that helps to clarify, and I appreciate your help, thank you.
I forked DosageCoverter and made a fix for plink conversions. Please try https://github.com/jonathonl/DosageConvertor. If this works for you, I will create a pull request in the main repo.
Thanks for responding so quickly - I can confirm the changes you made to DosageConvertor work and I am able to create valid plink files, so creating a pull request ought to be useful to others as well.
I've been looking into the results that minimac v4.1 is generating compared to the previous version though (with the same input data), and I can't understand it but the new version is producing a dosage vcf where many snp alleles are inconsistent with the reference data. The majority of alleles in the dosage file are different, such that in later stages of my analysis when I filter snps which match the reference, hardly anything matches and my results are thus very different/incorrect.
I can't say for sure without spending a lot more time on this (which I don't have at the moment) but often one or both of ref and alt are flipped/complementary in the new minimac version. So at the moment I am going to have to stay with the previous minimac version unless you have any good ideas why this is happening/what I should look for?
Thanks again
I'm very skeptical that Minimac4.1 is changing the reference allele. If this is happening, it is likely happening somewhere else in your pipeline. But if you can provide an example, I'll look into it.
Yes I agree it seems unlikely, but I've put together some files here for you to reproduce it, using the same input file and references. My pipeline runs on Centos7, so I've included 2 Dockerfiles which show how I build minimac v1.0.3 and v4.1.
It should be shared here: https://drive.google.com/file/d/1Zy01T1f4DzcaQVDjq4wan80rrXRPN4Y3/view?usp=sharing
Create a directory and extract the files, then build the docker images:
docker build -t test_103 -f Dockerfile_103 .
docker build -t test_4_1 -f Dockerfile_4_1 .
Run the docker container, bind mounting the current directory with the extracted files, and call minimac v1.0.3:
docker run --rm -it --mount type=bind,source=$(pwd),target=/home/root test_103 /bin/bash
cd /home/root
minimac4 --cpus 8 \
--refHaps chr13.1000g.Phase3.v5.With.Parameter.Estimates.m3vcf.gz \
--haps 23andme_chr13_eagle.vcf.gz \
--minRatio 0.01 \
--prefix 23andme_chr13_minimac1_0_3
Exit and run the other docker container, bind mounting the current directory with the extracted files, and call minimac v4.1:
docker run --rm -it --mount type=bind,source=$(pwd),target=/home/root test_4_1 /bin/bash
cd /home/root
minimac4 chr13.1000g.Phase3.v5.With.Parameter.Estimates.msav \
23andme_chr13_eagle.vcf.gz \
-o 23andme_chr13_minimac4_1.dose.vcf.gz \
--output-format vcf.gz \
--threads 8
If you now exit and compare 23andme_chr13_minimac1_0_3.dose.vcf.gz and 23andme_chr13_minimac4_1.dose.vcf.gz in the current directory, you can see the differences in ref/alt.
I generated the .msav files as instructed, using
minimac4 --update-m3vcf chr13.1000g.Phase3.v5.With.Parameter.Estimates.m3vcf.gz > chr13.1000g.Phase3.v5.With.Parameter.Estimates.msav
If you could take a look and tell me if there is something wrong with how I have built minimac/libraries or the msav reference that would be great.
Ok, I will check this out sometime this week.
Can you check the ref / alt in the reference panels ? Minimac should copy over the values from the panels provided. Maybe there are different?
On Tue, Oct 25, 2022, 8:10 PM Jonathon LeFaive @.***> wrote:
Ok, I will check this out sometime this week.
— Reply to this email directly, view it on GitHub https://github.com/statgen/Minimac4/issues/53#issuecomment-1291437027, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5YQCEBZ5UHKRTIVRQUPJTWFCOIVANCNFSM6AAAAAARIWYFTE . You are receiving this because you are subscribed to this thread.Message ID: @.***>
The .msav reference files are not text files, so I'm unable to check whether the .m3vcf.gz reference files are converted correctly?
Ahh Jonathon might be able to help with that.
On Tue, Oct 25, 2022, 10:12 PM sereeena @.***> wrote:
The .msav reference files are not text files, so I'm unable to check whether the .m3vcf.gz reference files are converted correctly?
— Reply to this email directly, view it on GitHub https://github.com/statgen/Minimac4/issues/53#issuecomment-1291510595, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5YQCDYABFKPWJYGWQBEVDWFC4VBANCNFSM6AAAAAARIWYFTE . You are receiving this because you commented.Message ID: @.***>
Your input target genotypes have the alleles switched when comparing against all 3 reference panels (see below). I'm guessing that v4.0.3 is allowing for this (maybe intentionally) (@Santy-8128 might be able to comment) whereas v4.1.0 is expecting the alleles to exactly match what is in the reference.
$ bcftools view ALL.chr13.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf | grep -v "^#" | head | grep 19020095 | cut -f1-5
13 19020095 rs140871821 C T
$ zcat chr13.1000g.Phase3.v5.With.Parameter.Estimates.m3vcf.gz | grep -v "^#" | head | grep 19020095 | cut -f1-5
13 19020095-19021250 <BLOCK:0-26> . .
13 19020095 13:19020095 C T
$ cget/bin/sav export chr13.1000g.Phase3.v5.With.Parameter.Estimates.msav | grep -v "^#" | head | grep 19020095 | cut -f1-5
13 19020095 . C <BLOCK>
13 19020095 . C T
$ zcat 23andme_chr13_eagle.vcf.gz | grep -v "^#" | head | grep 19020095 | cut -f1-5
13 19020095 rs140871821 T C
Thanks for helping with this! So it looks like there was always something wrong with our pipeline and that minimac 1.0.3 may have taken the alleles from the references while minimac 4.1 is taking it from the eagle input file, where alleles have been flipped?
I suspect it has something to do with plink, I have a bed/bim/fam where everything looks correct, then I convert to vcf with plink (which then goes on to eagle as a bcf, then eagle output goes to minimac). It is in the conversion to vcf where although I use --keep-allele-order with plink, it is still flipping them in the output vcf. So that will take time to investigate, but as this is not a minimac issue you can go ahead and close it, thanks again.