shapeit5
shapeit5 copied to clipboard
Silent imputation of missing variants
Hello,
As an experiment, I've tried take the vcf of the draft human pangenome, use bcftools +setGT to unphase the pangenome vcf, and then rephase the vcf using different reference panels. I am using shapeit5's switch tool to assess phasing accuracy. I am using the re-phased dataset as estimated vcf, and the original pangenome vcf as the verification vcf. The advantage of this method is the elimination of genotyping as a source of error, as the input vcf was generated from the verification vcf.
Switch, however, is detecting sporatic genotyping errors. These errors are occurring at sites with a ./1 genotype.
Variants with a combined indel/snp, like so:
Ref:
AATCGTCTGTC
Sample:
AA------GTC
AATTGTCTGTC
After using bcftools norm to split multiallelic sites, the pangenome VCF represents this region as: chr1 2 ATTGTCT A 1|0 chr1 4 T C .|1
Shapeit seems to be interpreting the .|1 call as a missing allele, and imputing the genotype at the site as 0/0. I don't think that is functioning as intented.
What is the best way to handle sites like this? If I used the atomize option of bcftools norm, I could represent the deletion allele as "*". Would shapeit recognize this?
Thanks!