htsjdk icon indicating copy to clipboard operation
htsjdk copied to clipboard

VariantContext has inconsistent behavior on getting allele type with spanning deletions

Open rickymagner opened this issue 2 years ago • 0 comments

Description of the issue:

For "multiallelic" sites containing a spanning deletion alleles (represented by *), the getType method in VariantContext behaves somewhat unexpectedly. There have been numerous old discussions surrounding spanning deletion handling in htsjdk, like here and here, but this particular issue is a little more focused. Regardless of what might be the "best" way to handle these special sites, the current situation is quite confusing.

If the * allele is next to a SNP or a deletion, the site is considered an INDEL. But if it's next to an insertion, it's not considered an INDEL. See below for concrete examples.

Your environment:

  • version of htsjdk: 3.0.1
  • version of java: Temurin-17.0.7+7
  • which OS: MacOS Ventura 13.6

Steps to reproduce

Set up the following files for a complete example.

example.fasta:

>chr1
AAAAAAAAAA

Run samtools faidx example.fasta and samtools dict example.fasta > example.dict. Then make the following vcf as example.vcf:

##fileformat=VCFv4.3
##contig=<ID=chr1,length=10>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	Sample
chr1	3	variant1	A	AAA,*	30	.	.	GT	1/2
chr1	4	variant2	A	C,*	30	.	.	GT	1/2
chr1	5	variant3	AAA	A,*	30	.	.	GT	1/2

Run bcftools view example.vcf -o example.vcf.gz to compress it and bcftools index -t example.vcf.gz. Now it's ready to run in GATK to see what happens. Run gatk SelectVariants -V example.vcf.gz -R example.fasta -O gatk-snp.vcf.gz -select 'vc.isSNP()' and gatk SelectVariants -V example.vcf.gz -R example.fasta -O gatk-indel.vcf.gz -select 'vc.isIndel()'

As expected, gatk-snp.vcf.gz looks like (with bcftools view -H):

chr1	4	variant2	A	C,*	30	.	.	GT	1/2

but gatk-indel.vcf.gz looks like:

chr1	5	variant3	AAA	A,*	30	.	.	GT	1/2

In other words, the site with <INS>,* is excluded while <DEL>,* is included when looking for INDEL sites.

Expected behaviour

I don't know what's the best way once and for all to classify VariantContexts with spanning deletions, but to me this seems like a "bug" either way. Whatever is decided in classifying the <INS>,* type should equally apply to <DEL>,*. For what it's worth, my instinct is to expect these sites to both be classified as "Indel" in this way since the spanning deletion is a deletion anyway. This is consistent with the behavior that SelectVariants will also pull out a multiallelic DEL+INS site as "Indel."

Actual behaviour

See above.

rickymagner avatar Oct 02 '23 16:10 rickymagner