bcftools icon indicating copy to clipboard operation
bcftools copied to clipboard

should bcftools norm --multiallelics - respect lexicographical order?

Open freeseek opened this issue 3 years ago • 1 comments

This is not an issue, but a behavior that maybe could be more optimal.

I have noticed that when splitting multiallelic variants in different VCF files, you don't necessarily end up with variants in the same order. Here is an example:

(echo "##fileformat=VCFv4.1"
echo "##contig=<ID=chr19>"
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"
echo -e "chr19\t50359054\t.\tC\tT,A\t.\t.\t.") > in.vcf

Now if I split with bcftools norm --multiallelics -:

$ bcftools norm --multiallelics - --no-version in.vcf
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr19>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
chr19	50359054	.	C	T	.	.	.
chr19	50359054	.	C	A	.	.	.
Lines   total/split/realigned/skipped:	1/1/0/0

But if I further sort the file:

$ bcftools norm --multiallelics - --no-version in.vcf | bcftools sort
Writing to /tmp/bcftools-sort.HmQM3B
Lines   total/split/realigned/skipped:	1/1/0/0
Merging 1 temporary files
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr19>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
chr19	50359054	.	C	A	.	.	.
chr19	50359054	.	C	T	.	.	.
Cleaning
Done

The order of the variants has changed. I suppose it would be quite a bit of work to rewrite bcftools norm to sort the variants after splitting them so that you would not require a sort but reporting nevertheless just in case.

I have noticed this as some HTSlib tools (such as IMPUTE5 v1.1.4) do not work if VCFs have different orders in the variants.

freeseek avatar May 09 '21 23:05 freeseek

Adding this could be quite straightforward by using vcfbuf_t with a new mode SORT_BY_ALLELE mode. This would make all output from norm sorted this way.

pd3 avatar May 19 '21 13:05 pd3