mocha Can I Use CEL Files from Axiom Precision Medicine Research Array (PMRA) Release 3 to Generate CHP Files and Obtain VCF Files Annotated with GC, BAF, and LRR Content?

Hello,

I am a beginner in the field of mosaic chromosome alterations and I'm trying to use the Mocha pipeline with original CEL files sequenced in Axiom Precision Medicine Research Array (PMRA) release 3 to generate CHP files. According to the pipeline documentation, for Affymetrix data, a CSV manifest file is required, and for CEL files, an XML option file and a ZIP file containing all files listed within the XML option file are needed. The documentation also mentions that only analysis types AxiomGT1 and birdseed are supported, and Affymetrix arrays before the Genome-Wide Human SNP Array 6.0 are not supported.

My question is, can I input CEL files from Axiom PMRA release 3 to generate CHP files and obtain VCF files annotated with GC content, BAF, and LRR?

If this is not possible, I have three separate files containing genotype data, GC content for variants, and BAF and LRR data for variants and individuals. Do you have any suggestions or scripts for merging these files to annotate the genotype data with GC content, BAF, and LRR to generate a VCF file for use with Mocha?

I apologize if my questions seem naive; I'm still learning about this field.

Thank you for your assistance.

Mar 31 '24 20:03 Yanranli007

I am not sure but I don't think it is possible to generate CHP files for Axiom PMRA. But you can generate .calls.txt/.confidences.txt/.summary.txt/.snp-posteriors.txt files and convert them to VCF with the following command:

bcftools +affy2vcf \
    --csv <manifest> \
    --fasta-ref <fasta> \
    --calls AxiomGT1.calls.txt \
    --confidences AxiomGT1.confidences.txt \
    --summary AxiomGT1.summary.txt \
    --snp AxiomGT1.snp-posteriors.txt \
    --output AxiomGT1.vcf

Mar 31 '24 21:03 freeseek

I am not sure but I don't think it is possible to generate CHP files for Axiom PMRA. But you can generate .calls.txt/.confidences.txt/.summary.txt/.snp-posteriors.txt files and convert them to VCF with the following command:
bcftools +affy2vcf \
    --csv <manifest> \
    --fasta-ref <fasta> \
    --calls AxiomGT1.calls.txt \
    --confidences AxiomGT1.confidences.txt \
    --summary AxiomGT1.summary.txt \
    --snp AxiomGT1.snp-posteriors.txt \
    --output AxiomGT1.vcf

Dear Dr. Genovese, Thank you for your response during the Easter holiday. I hope you continue to enjoy the remainder of your holiday.

I understand that the initial approach is to convert Affymetrix CEL files to CHP files using apt-probeset-genotype, and then to convert CHP files to VCF format using affy2vcf.

However, in my case, I need to convert Axiom PMRA CEL files to generate .calls.txt, .confidences.txt, .summary.txt, and .snp-posteriors.txt files, which will then be converted to a standardized VCF with GC, LRR, and BAF for Mocha analysis. Do you have any suggestions or scripts that could guide me in converting the Axiom PMRA CEL files to generate these specific files?

Additionally, am I correct in understanding that you do not recommend annotating the genotype data with separate files containing GC content for variants, and files with information for BAF and LRR for variants and individuals?

Mar 31 '24 22:03 Yanranli007

It seems like all the online documentation has been taken offline by Thermofisher. But you can still download the Analysis Power Tools and look into the file doc/vignette-axiom-probeset-genotype.html for further instructions. I have never found Axiom data to download and to test scripts on so I have no experience with this

Apr 01 '24 06:04 freeseek

It seems like all the online documentation has been taken offline by Thermofisher. But you can still download the Analysis Power Tools and look into the file doc/vignette-axiom-probeset-genotype.html for further instructions. I have never found Axiom data to download and to test scripts on so I have no experience with this Dear Dr. Genovese, thanks for your response, we have the txt files. I will try to contert them to vcf files.

Apr 02 '24 15:04 Yanranli007

Dear Dr. Genovese,

Under your guidance, I successfully converted various file types (.calls.txt, .confidences.txt, .summary.txt, and .snp-posteriors.txt) into a standardized VCF format incorporating GC, LRR, and BAF values, and have completed the phasing step.

However, I've encountered an issue during the imputation process using IMPUTE5 v1.2.0. I used the GRCh37 reference and have outlined my script below: `panel_pfx="/genetics/GRCh37/ALL.chr" panel_sfx=".phase3_integrated.20130502.genotypes" map="/genetics/GRCh37/genetic_map_hg19_withX.txt.gz" thr="10" pfx="stroke" dir="/genetics/dir"

for chr in {1..22} X; do zcat $map | sed 's/^23/X/' | awk -v chr=$chr '$1==chr {print chr,".",$4,$2}' > $dir/genetic_map.chr$chr.txt impute5
--h $panel_pfx${chr}$panel_sfx.bcf
--m $dir/genetic_map.chr$chr.txt
--g $dir/$pfx.pgt.bcf
--r $chr
--o $dir/$pfx.chr$chr.imp.bcf
--l $dir/$pfx.chr$chr.log
--threads $thr done`

Initially, I encountered an error for each chromosome stating, "You must specify a region or chromosome to impute using --buffer-region". To address this, I modified the script to include a buffer region of 500kb: ` buffer_size="500kb" # Define buffer size, adjust as needed based on your data characteristics

for chr in {1..22} X; do zcat $map | sed 's/^23/X/' | awk -v chr=$chr '$1==chr {print chr,".",$4,$2}' > $dir/genetic_map.chr$chr.txt impute5
--h $panel_pfx${chr}$panel_sfx.bcf
--m $dir/genetic_map.chr$chr.txt
--g $dir/$pfx.pgt.bcf
--r $chr
--buffer-region $buffer_size
--o $dir/$pfx.chr$chr.imp.bcf
--l $dir/$pfx.chr$chr.log
--threads $thr done ` Despite this adjustment, a new error emerged stating, “ERROR: Input region needs to be specified as chrX:Y-Z (chromosome ID cannot be extracted)”. It appears that IMPUTE5 requires more detailed chromosomal region specifications.

Could you please assist me in resolving this issue?

Thank you in advance for your time and help.

Apr 24 '24 03:04 Yanranli007

Dear Dr. Genovese, When I changed the version to IMPUTE5 v1.1.5. This process worked. However, I've noticed that the process still requires a significant amount of memory and CPU resources, even when running on a single thread. To address this, I am considering splitting the chromosome data into smaller, non-overlapping regions. I've drafted a script to implement this approach and would appreciate your input on its feasibility and efficiency. Here is the script:

region_size=5000000 # Define the size of each region (e.g., 5 Mb)

for chr in {1..22} X; do

zcat $map | sed 's/^23/X/' | awk -v chr=$chr '$1==chr {print chr,".",$4,$2}' > $dir/genetic_map.chr$chr.txt start_pos=$(awk 'NR==1 {print $4}' $dir/genetic_map.chr$chr.txt) end_pos=$(awk 'END{print $4}' $dir/genetic_map.chr$chr.txt) region_files=""

for ((pos=$start_pos; pos<=$end_pos; pos+=$region_size)); do region_end=$((pos + region_size - 1)) output_file=$dir/$pfx.chr${chr}${pos}${region_end}.imp.bcf

impute5_1.5 \
  --h $panel_pfx${chr}$panel_sfx.bcf \
  --m $dir/genetic_map.chr$chr.txt \
  --g $dir/$pfx.pgt.bcf \
  --r $chr:$pos-$region_end \
  --o $output_file \
  --l $dir/$pfx.chr${chr}_${pos}_${region_end}.log \
  --threads $thr

 region_files+="$output_file "

done

bcftools concat -o $dir/$pfx.chr$chr.imp.bcf -O b --threads $thr $region_files

done"

Thanks you!

Apr 25 '24 02:04 Yanranli007

IMPUTE5 v1.2.0 is much more memory efficient than IMPUTE5 v1.1.5. Nevertheless, the author discourages using it across whole chromosomes and it should always be run over chromosome chunks. This can get complicated as it requires overlapping windows and it is beyond the scope of the current documentation. However, the full approach with scattering into overlapping windows and concatenating back into full chromosomes is implemented in the imputation WDL pipeline, part of the MoChA WDL project

Apr 25 '24 15:04 freeseek

IMPUTE5 v1.2.0 is much more memory efficient than IMPUTE5 v1.1.5. Nevertheless, the author discourages using it across whole chromosomes and it should always be run over chromosome chunks. This can get complicated as it requires overlapping windows and it is beyond the scope of the current documentation. However, the full approach with scattering into overlapping windows and concatenating back into full chromosomes is implemented in the imputation WDL pipeline, part of the MoChA WDL project

Dear Dr. Genovese, With your detailed code and responses, I was able to successfully execute the call of MCA. My sincere gratitude for your guidance on Mocha

May 01 '24 03:05 Yanranli007

mocha mocha copied to clipboard

Can I Use CEL Files from Axiom Precision Medicine Research Array (PMRA) Release 3 to Generate CHP Files and Obtain VCF Files Annotated with GC, BAF, and LRR Content?

mocha
mocha copied to clipboard