bcftools icon indicating copy to clipboard operation
bcftools copied to clipboard

Feature request: the inverse of `norm --atomize`

Open linnil1 opened this issue 2 years ago • 1 comments

I found normalization will give incorrect result in this case:

                    1
                    |
 a1                 ACGCTCCTCC TCCAGA
 a2                 -------*** ------
 a3                 ---*------ ------
 a4                 ---*---*** ------
sh-5.0# bcftools view bug.vcf
...
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  a1      a2      a3      a4
a1      7       .       CTCC    C       .       .       .       GT      0       1       0       1
a1      3       .       GC      G       .       .       .       GT      0       0       1       1

In this case, TCC is the repeat element right after a deletion GC -> G. The normalization algorithm will left-align the deletion TCC in position 8-10 to position 5-7 and merge(?) the indel in position 4 at the same time.

VCF

sh-5.0# bcftools norm bug.vcf -f bug.ref.fa
...
##bcftools_normVersion=1.13+htslib-1.13
##bcftools_normCommand=norm -f bug.ref.fa bug.vcf; Date=Fri May 27 08:29:27 2022
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  a1      a2      a3      a4
a1      3       .       GCTC    G       .       .       .       GT      0       1       0       1
a1      3       .       GC      G       .       .       .       GT      0       0       1       1
Lines   total/split/realigned/skipped:  2/0/1/0

However, I don't why, the result become GCTC -> G, which is not a TCC deletion nor CTCC deletion either.

I think this is the multi-allelic case GCTCC -> {G, GC, GTCC} which I would possibly expected like so:

#CHROM  POS     ID      REF     ALT        QUAL    FILTER  INFO    FORMAT  a1      a2      a3      a4
a1      3       .       GCTCC    G         .       .       .       GT      0       0       0       1
a1      3       .       GCTCC    GC        .       .       .       GT      0       1       0       0
a1      3       .       GCTCC    GTCC      .       .       .       GT      0       0       1       0

Reproduce

Fasta

root@5d7310b4d3a0:/app# cat bug.ref.fa
>a1
ACGCTCCTCCTCCAGA

VCF

sh-5.0# bcftools view bug.vcf
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=20220527
##contig=<ID=a1,length=16>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##bcftools_viewVersion=1.13+htslib-1.13
##bcftools_viewCommand=view bug.vcf; Date=Fri May 27 08:28:17 2022
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  a1      a2      a3      a4
a1      7       .       CTCC    C       .       .       .       GT      0       1       0       1
a1      3       .       GC      G       .       .       .       GT      0       0       1       1

I ran the bcftools in docker with the latest image quay.io/biocontainers/bcftools:1.13--h3a49de5_0

linnil1 avatar May 27 '22 09:05 linnil1

I don't see a problem, the normalization is doing the right thing:

1234567
ACGCTCCTCCTCCAGA   ref
ACGCTCC---TCCAGA   ori 
ACG---CTCCTCCAGA   norm

In the original form the variant is 7:CTCC>C which results in the sequence ACGCTCC---TCCAGA (ACGCTCCTCCAGA), and in the normalized form the variant is 3:GCTC>G which results in the identical sequence ACG---CTCCTCCAGA (ACGCTCCTCCAGA). Such ambiguity is not uncommon in short repeats. One should be comparing only the resulting sequence, not the variants.

However, there is one real problem, each row is treated independently and can result in conflicting haplotypes. For example, the sample a4 cannot have two different deletions at the same position given it's haploid, so in that sense the program could be smarter and implement the reverse of the --atomize operation. However, that has not been done yet. For this. reason I'll leave the issue open and mark it as enhancement.

pd3 avatar Jun 06 '22 14:06 pd3