shapeit5 icon indicating copy to clipboard operation
shapeit5 copied to clipboard

Silent imputation of missing variants

Open JosephLalli opened this issue 1 year ago • 2 comments

Hello,

As an experiment, I've tried take the vcf of the draft human pangenome, use bcftools +setGT to unphase the pangenome vcf, and then rephase the vcf using different reference panels. I am using shapeit5's switch tool to assess phasing accuracy. I am using the re-phased dataset as estimated vcf, and the original pangenome vcf as the verification vcf. The advantage of this method is the elimination of genotyping as a source of error, as the input vcf was generated from the verification vcf.

Switch, however, is detecting sporatic genotyping errors. These errors are occurring at sites with a ./1 genotype.

Variants with a combined indel/snp, like so:

Ref:
AATCGTCTGTC
Sample:
AA------GTC
AATTGTCTGTC

After using bcftools norm to split multiallelic sites, the pangenome VCF represents this region as: chr1 2 ATTGTCT A 1|0 chr1 4 T C .|1

Shapeit seems to be interpreting the .|1 call as a missing allele, and imputing the genotype at the site as 0/0. I don't think that is functioning as intented.

What is the best way to handle sites like this? If I used the atomize option of bcftools norm, I could represent the deletion allele as "*". Would shapeit recognize this?

Thanks!

JosephLalli avatar Jul 28 '23 19:07 JosephLalli