bcftools icon indicating copy to clipboard operation
bcftools copied to clipboard

Remove the `*` ALT Allele Reporting From a g.vcf Before (or After) Genotyping

Open jon4thin opened this issue 1 year ago • 1 comments

I have a large scale dataset of WXS .g.vcfs , I do not have the original FASTA files. I would like to genotype these .g.vcfs into vcfs. These vcfs will be used in downstream applications. The issue is that the g.vcfs were created with GATK HaplotypeCaller without the "--disable-spanning-event-genotyping" flag set to true. This means that * SNPs are introduced so that there is a reported "SNP" at a specific site that an upstream INDEL is spanning (INDEL overlapping a SNP) . This is an issue because I am not interested in * genotypes - my understanding is that this information is already stored in the upstream deletion - and the * are not standard IUPAC base annotation and thus triggers errors in many downstream applications.

What is the correct way to get rid of the reporting to these * ALT alleles?

Looking at bcftools merge I noted the -m ** flag, so I tried this first on the VCF after genotyping:

./bcftools merge **-m none,\*\*** --force-single Genotyped_Sample.vcf.gz -Oz -o merged_Genotyped_Sample.vcf.gz

and then on the original, pre-genotyped g.vcf:

./bcftools merge **-m none,\*\*** --force-single Sample.g.vcf.gz -Oz -o merged_Sample.g.vcf.gz

Neither successfully remove the * ALT alleles.....

WHY DONT YOU JUST REMOVE THE SNPS WITH THE * ALT ALLELE?

The only reason I am hesitant to just remove these variants with grep or awk or something is because I am actually working with sequenced trio, which reports variant phasing information. What I noticed with this data is that after I extract a single subject from the trio vcf (which contains 3 subjects, mother, father and child), I have instances where * ALT variants are present, but the phasing information seems to suggest that they are on the allele opposite of the upstream INDEL. Here is an example - starting with variants in trio VCF:

#CHROM  POS      REF     ALT
chr1    154590147  CCG     C
chr1    154590148  CG      C
chr1    154590149  G       *
chr1    154590149  G       C

and then, when I just extract a single subject bcftools query:

#CHROM  POS         REF   ALT     GT
chr1     154590148   CG  C      0|1
chr1     154590149   G   *      1|0
chr1     154590149   G   C      0|1

maybe I can just delete these SNPs, because the * annotation was not build with phasing in mind and this is just an artifact.

jon4thin avatar Aug 14 '24 16:08 jon4thin

If the spanning star allele is the only ALT, then simply removing the record would work. However, not for cases like this

#CHROM  POS         REF   ALT     GT
chr1     154590148   CG  C      0|1
chr1     154590149   G   A,*    1|2
chr1     154590149   G   C      0|1

pd3 avatar Sep 09 '24 09:09 pd3