bcftools
bcftools copied to clipboard
Feature request: the inverse of `norm --atomize`
I found normalization will give incorrect result in this case:
1
|
a1 ACGCTCCTCC TCCAGA
a2 -------*** ------
a3 ---*------ ------
a4 ---*---*** ------
sh-5.0# bcftools view bug.vcf
...
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a1 a2 a3 a4
a1 7 . CTCC C . . . GT 0 1 0 1
a1 3 . GC G . . . GT 0 0 1 1
In this case, TCC
is the repeat element right after a deletion GC -> G
.
The normalization algorithm will left-align the deletion TCC
in position 8-10 to position 5-7 and merge(?) the indel in position 4 at the same time.
VCF
sh-5.0# bcftools norm bug.vcf -f bug.ref.fa
...
##bcftools_normVersion=1.13+htslib-1.13
##bcftools_normCommand=norm -f bug.ref.fa bug.vcf; Date=Fri May 27 08:29:27 2022
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a1 a2 a3 a4
a1 3 . GCTC G . . . GT 0 1 0 1
a1 3 . GC G . . . GT 0 0 1 1
Lines total/split/realigned/skipped: 2/0/1/0
However, I don't why, the result become GCTC -> G
, which is not a TCC
deletion nor CTCC
deletion either.
I think this is the multi-allelic case GCTCC -> {G, GC, GTCC}
which I would possibly expected like so:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a1 a2 a3 a4
a1 3 . GCTCC G . . . GT 0 0 0 1
a1 3 . GCTCC GC . . . GT 0 1 0 0
a1 3 . GCTCC GTCC . . . GT 0 0 1 0
Reproduce
Fasta
root@5d7310b4d3a0:/app# cat bug.ref.fa
>a1
ACGCTCCTCCTCCAGA
VCF
sh-5.0# bcftools view bug.vcf
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=20220527
##contig=<ID=a1,length=16>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##bcftools_viewVersion=1.13+htslib-1.13
##bcftools_viewCommand=view bug.vcf; Date=Fri May 27 08:28:17 2022
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a1 a2 a3 a4
a1 7 . CTCC C . . . GT 0 1 0 1
a1 3 . GC G . . . GT 0 0 1 1
I ran the bcftools
in docker with the latest image
quay.io/biocontainers/bcftools:1.13--h3a49de5_0
I don't see a problem, the normalization is doing the right thing:
1234567
ACGCTCCTCCTCCAGA ref
ACGCTCC---TCCAGA ori
ACG---CTCCTCCAGA norm
In the original form the variant is 7:CTCC>C
which results in the sequence ACGCTCC---TCCAGA
(ACGCTCCTCCAGA
), and in the normalized form the variant is 3:GCTC>G
which results in the identical sequence ACG---CTCCTCCAGA
(ACGCTCCTCCAGA
). Such ambiguity is not uncommon in short repeats. One should be comparing only the resulting sequence, not the variants.
However, there is one real problem, each row is treated independently and can result in conflicting haplotypes. For example, the sample a4 cannot have two different deletions at the same position given it's haploid, so in that sense the program could be smarter and implement the reverse of the --atomize
operation. However, that has not been done yet. For this. reason I'll leave the issue open and mark it as enhancement.