bcftools
bcftools copied to clipboard
should bcftools norm --multiallelics - respect lexicographical order?
This is not an issue, but a behavior that maybe could be more optimal.
I have noticed that when splitting multiallelic variants in different VCF files, you don't necessarily end up with variants in the same order. Here is an example:
(echo "##fileformat=VCFv4.1"
echo "##contig=<ID=chr19>"
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"
echo -e "chr19\t50359054\t.\tC\tT,A\t.\t.\t.") > in.vcf
Now if I split with bcftools norm --multiallelics -
:
$ bcftools norm --multiallelics - --no-version in.vcf
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr19>
#CHROM POS ID REF ALT QUAL FILTER INFO
chr19 50359054 . C T . . .
chr19 50359054 . C A . . .
Lines total/split/realigned/skipped: 1/1/0/0
But if I further sort the file:
$ bcftools norm --multiallelics - --no-version in.vcf | bcftools sort
Writing to /tmp/bcftools-sort.HmQM3B
Lines total/split/realigned/skipped: 1/1/0/0
Merging 1 temporary files
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr19>
#CHROM POS ID REF ALT QUAL FILTER INFO
chr19 50359054 . C A . . .
chr19 50359054 . C T . . .
Cleaning
Done
The order of the variants has changed. I suppose it would be quite a bit of work to rewrite bcftools norm
to sort the variants after splitting them so that you would not require a sort but reporting nevertheless just in case.
I have noticed this as some HTSlib tools (such as IMPUTE5 v1.1.4) do not work if VCFs have different orders in the variants.
Adding this could be quite straightforward by using vcfbuf_t with a new mode SORT_BY_ALLELE mode. This would make all output from norm
sorted this way.