bcftools icon indicating copy to clipboard operation
bcftools copied to clipboard

norm remove duplicates doesn't handle SVLEN, removes non-duplicate symbolic variants

Open davmlaw opened this issue 1 year ago • 0 comments

The following VCF contains 3 deletions of length 1kb, 2kb and 3kb:

##fileformat=VCFv4.1
##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##contig=<ID=NC_000012.11,length=141213431>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
NC_000012.11	88520131	23651	C	<DEL>	.	.	SVLEN=-1000;SVTYPE=DEL
NC_000012.11	88520131	24042	C	<DEL>	.	.	SVLEN=-2000;SVTYPE=DEL
NC_000012.11	88520131	24043	C	<DEL>	.	.	SVLEN=-3000;SVTYPE=DEL

If you run (even with "exact") it removes the records with the same chrom/pos/ref/alt even though SVLEN is different (and thus separate variants)

bcftools norm --remove-duplicates --rm-dup=exact symbolic_uniq.vcf

If this is difficult, it would be good to at the least raise a warning about this, as current behavior is silent data loss. Thanks

bcftools --version
bcftools 1.20
Using htslib 1.20

davmlaw avatar May 09 '24 03:05 davmlaw