gtc2vcf
gtc2vcf copied to clipboard
Pseudo-autosomal regions (PAR).
The pseudo-autosomal regions is often annotated in the Illumina's CSV manifest as XY
chrom. gtc2vcf probably recode them as chrom X
in the output vcf:
https://github.com/freeseek/gtc2vcf/blob/224e7c60b81188342a029ec89f3777537fa7b4f6/gtc2vcf.h#L138-L143
However, it may be strongly encouraged to realign to the reference genome as emphasized in the documentation. If Illumina's CSV manifest is used directly, the output accuracy relies on the Illumina's CSV manifest. Sometimes this PAR may not be correctly annotated in the CSV manifest and the SNPs may actually be onto unique regions on the Y chrom.
For example, in the GSA chip ~80+ SNPs are annotated as XY
which actually are actually located on unique regions on the Y
chrom.
A few snps from the input CSV manifest:
rs10465468,XY,92708060 rs112096861,XY,92541266 rs12401272,XY,3211973 rs185597746,XY,92386542 rs188145685,XY,91773744
In the output vcf records:
rs10465468 chrX 92708060 rs112096861 chrX 92541266 rs12401272 chrX 3211973 rs185597746 chrX 92386542 rs188145685 chrX 91773744
However, all these SNPS appeear outside the PAR region ((https://useast.ensembl.org/info/genome/genebuild/human_PARS.html) and onto unique region of the Y chrom (e.g. https://ncbi.nlm.nih.gov/snp/rs10465468 ). If the realignment workflow is chosen, the SourceSeq uniquely maps to Y chrom and corrects it. An additional note on this is that if the SNPS indeed lie within PAR region, under the realignment workflow it will still be annotated as X
chrom since the PAR regions is hardmasked on Y
chrom.
Thought to write here for the interest of any other user who runs into this observation.
.