gtc2vcf
gtc2vcf copied to clipboard
Affy2vcf snp-posterior file
Hi Giulio,
The array Finngen2 Axiom custom array from ThermoFisher was used for genotyping in our project. We would like to convert the .CEL files to a .vcf file to run the MoChA pipeline. We found your answer on https://www.biostars.org/p/9554233/ about this type of array and implemented this apt-genotype-axiom tool to convert our .CEL files to .txt files as input for the affy2vcf tool with this code:
path_to_output_folder="[PATH_TO_PROJECT]/CHP_files"
cel_list_file="[PATH_TO_PROJECT]/CEL_FILES_PASSEDQC.txt"
ref_files="[PATH_TO_PROJECT]/ref_files/Axiom_FinnGen2_Analysis.r1"
apt-genotype-axiom \
--analysis-files-path $ref_files/ \
--arg-file $ref_files/Axiom_FinnGen2_96orMore_Step2.r1.apt-genotype-axiom.AxiomGT1.apt2.xml \
--do-rare-het-adjustment true \
--out-dir $path_to_output_folder \
--cel-files $cel_list_file \
--log-file $path_to_output_folder/genotyping.log \
--genotyping-node:snp-posteriors-output true
However, using this option --genotyping-node:snp-posteriors-output true
, a lot of SNPs are lost in the conversion in the AxiomGT1.calls.txt (n=217,259 SNPs with this --genotyping-node:snp-posteriors-output true
option and n=643,408 SNPs without this option). This is the parameter for this file in the .xml file: --multi-genotyping-node:multi-posteriors-output true
.
Without this --genotyping-node:snp-posteriors-output true
option, no AxiomGT1.snp-posteriors.txt file was created but only the AxiomGT1.snp-posteriors.multi.txt file. Is it possible to run the affy2vcf tool with only the AxiomGT1.snp-posteriors.multi.txt file? Or do I need the AxiomGT1.snp-posteriors.txt as well?
This is how we run the affy2vcf tool:
csv_manifest_file="[PATH_TO_PROJECT]/Axiom_FinnGen2.na36.r1.a1.annot.b37.csv"
ref="[PATH_TO_PROJECT]/human_g1k_v37.fasta"
brondir='[PATH_TO_PROJECT]/CHP_files'
targetdir='[PATH_TO_PROJECT]/1.CHP_to_VCF_conversion'
pfx='converted'
bcftools +affy2vcf \
--csv $csv_manifest_file \
--fasta-ref $ref \
--calls $brondir/AxiomGT1.calls.txt \
--confidences $brondir/AxiomGT1.confidences.txt \
--summary $brondir/AxiomGT1.summary.txt \
--snp $brondir/AxiomGT1.snp-posteriors.multi.txt | \
bcftools sort -Ou -T ./bcftools. | \
bcftools norm --no-version -Ob -c x -f $ref | \
tee $pfx.bcf | \
bcftools index --force --output $pfx.bcf.csi
If we use the 'AxiomGT1.snp-posteriors.multi.txt' file, we get this error:
Could not read VCF/BCF headers from -
Cleaning
Failed to read from standard input: unknown file type
index: "-" is in a format that cannot be usefully indexed
Our questions are:
- Do you know why these SNPs disappear in the first step (running the apt-genotype-axiom tool) using the option
--genotyping-node:snp-posteriors-output true
? - Do we need this AxiomGT1.snp-posteriors.txt file or is the AxiomGT1.snp-posteriors.multi.txt file also enough to run the affy2vcf tool from this github page? If it is possible to run it only with the AxiomGT1.snp-posteriors.multi.txt file, how do we fix this error?
Unfortunately Affymetrix does not do a good job of providing Axiom array data for developers to test their code. I am not sure what the problem with your genotype losses is due to. The option --genotyping-node:snp-posteriors-output true
should really not change anything other than providing an additional output. This might be an Affymetrix bug or maybe your command died while running. The error you show for BCFtools/affy2vcf is likely related to the first command having an issue. Maybe try rerunning without piping into BCFtools/sort to see what problem you get. I also don't understand why you have a .snp-posteriors.multi.txt
rather than a .snp-posteriors.txt
file. Maybe the format is slightly different and that is what is causing the problem. Would it be possible for me to have access to some of these file to test on my end? I would be happy to update the code to make it work as it is most likely a minor format issue if it is an issue with BCFtools/affy2vcf at all.
Thanks for your fast reply! Without piping into BCFtools/sort I got the same error. First, I will try to run the script again and give it more memory and time. I did not get an error because of time or memory limits but maybe it will fix the problem. If that is not the case, I will discuss with my collegues about sharing some (or parts) of our files.
I highly doubt that without piping into BCFtools/sort you still get the error:
Could not read VCF/BCF headers from -
As this is an error message from BCFtools/sort
Ah yes I did something wrong. This is the error from the affy2vcf tool if I use the AxiomGT1.snp-posteriors.multi.txt file as input:
Reading SNP posteriors file /groups/umcg-lifelines/tmp01/projects/ov19_0503/MoChA_pipeline_UGLI2/0.CEL_to_CHP_conversion/CHP_files_genotyping/AxiomGT1.snp-posteriors.multi.txt
Malformed header line in SNP model file /groups/umcg-lifelines/tmp01/projects/ov19_0503/MoChA_pipeline_UGLI2/0.CEL_to_CHP_conversion/CHP_files_genotyping/AxiomGT1.snp-posteriors.multi.txt:
probeset_id copynumber nAlleles cluster mean nObsMean covariance nObsVar
If I open the AxiomGT1.snp-posteriors.multi.txt file this is the header:
probeset_id copynumber nAlleles cluster mean nObsMean covariance nObsVar
id_1 2 3 AA 13.138457,11.458281,11.744538 0.200000 0.030000,0.000000,0.000000,0.030000,0.000000,0.030000 1.000000
and this is the header of the file AxiomGT1.snp-posteriors.txt:
id BB AB AA CV
id_1 -2.861181,0.037860,8476.672760,8476.672760,9.261750,0.068677,-0.020879 0.024676,0.037860,13887.279746,13887.279746,10.130948,0.068677,-0.003455 2.733218,0.037860,5945.047494,5945.047494,9.343756,0.068677,0.005921 -0.1,-0.1,-0.1,-0.05,-0.1,-0.05,0,0,0,0,0,0
I think these files contain different information?
This does not seem to make much sense. Are you sure your header isn't something like this for AxiomGT1.snp-posteriors.txt:
id BB AB AA CV
id_1 -2.861181,0.037860,8476.672760,8476.672760,9.261750,0.068677,-0.020879 0.024676,0.037860,13887.279746,13887.279746,10.130948,0.068677,-0.003455 2.733218,0.037860,5945.047494,5945.047494,9.343756,0.068677,0.005921 -0.1,-0.1,-0.1,-0.05,-0.1,-0.05,0,0,0,0,0,0
It just does not seem like the header and what comes after that have the same number of columns, which would definitely be a bug. What version of Affymetrix Power Tools are you using?
Also, if you have both the AxiomGT1.snp-posteriors.txt
and the AxiomGT1.snp-posteriors.multi.txt
files, why are you using the latter with BCFtools/affy2vcf?
Yes indeed, I checked it and something went wrong with copying the head of this file. I changed it in my previous post.
The problem is when I use the option --genotyping-node:snp-posteriors-output true
, a lot of SNPs disappear in conversion. When I leave this option out, I only have the AxiomGT1.snp-posteriors.multi.txt
file, which cannot be used as input for the affy2vcf tool (as described in the first post).
I'm running the apt-genotype-axiom tool now again with more memory and time but it will take some days before I have the result. Hopefully, altough I did not get a memory/time limit error, this will fix the problem of losing so much SNPs in the conversion. I will keep you updated.
Can you also share the correct first two lines of the AxiomGT1.snp-posteriors.multi.txt
file? As it stands edited at this moment it seems as if the files AxiomGT1.snp-posteriors.txt
and AxiomGT1.snp-posteriors.multi.txt
files have the same data on the second line. I want to understand whether it would be possible to parse the AxiomGT1.snp-posteriors.multi.txt
file
The previous comment is updated with the header from the AxiomGT1.snp-posteriors.multi.txt
file.
It is still extremely puzzling as the header has seven columns and the first line of the table has eight columns. And it is quite unclear what column the AA
entry should correspond to