bcftools icon indicating copy to clipboard operation
bcftools copied to clipboard

bcftools fails to genotype MNP/indels from targeted-sequencing reads

Open haochenz96 opened this issue 2 years ago • 4 comments

Hi,

Given a list of known alleles, I was trying to genotype REF/ALT read counts in some samples' BAM files. Relevant to the issue below is that these are libraries generated by targeted amplicon-sequencing. I used the following command:

bcftools mpileup \
            -Ou \
            -R {params.PANEL_AMPLICON} \
            -f {params.REF_GENOME} \
            --annotate FORMAT/AD,FORMAT/DP,INFO/AD \
            --max-depth 100000 \
            --max-idepth 100000 \
            --no-BAQ \
            --tandem-qual 10000 \
            {input.SC_BAM} | \
        bcftools call \
            --keep-alts \
            -C alleles \
            -T {input.CANDIDATE_ALLELE} \
            --multiallelic-caller \
            -Oz \
            -o {output.SC_MPILEUP_VCF} && \
        tabix {output.SC_MPILEUP_VCF}
        """

But for ALL of the MNP/Indels in the given allele list, the output VCF contains 0 ALT read in FORMAT/AD despite they show up in the INFO/AD field, as shown for the chr12:25398284 locus below:

12      25398284        .       CC      AT      146     .       DP=59;AD=36,19;VDB=1.29641e-11;SGB=-0.69168;RPB=1;MQB=1;BQB=0.825401;MQ0F=0;AC=0;AN=2;DP4=0,36,0,19;MQ=60       GT:PL:DP:AD     0/0:136,244,255:55:36,0
12      46230514        .       T       C       282     .       DP=599;AD=595,0;MQ0F=0;AC=0;AN=2;DP4=595,0,0,0;MQ=60    GT:PL:DP:AD     0/0:0,255,255:595:595,0
12      46243885        .       C       G       189     .       DP=418;AD=174,235;VDB=0;SGB=-0.693147;RPB=0.720526;MQB=1;BQB=0.999993;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=174,0,235,0;MQ=60      GT:PL:DP:AD     0/1:222,0,197:409:174,235

Note that this applies to every indel/MNP so is very likely caused by some call setting not compatible with the sequencing technology. I feel like this could be a easy fix by disabling some base/mapping quality filters, since my goal is to simply get the REF/ALT counts for each known allele. I would appreciate any help!

haochenz96 avatar Apr 24 '22 19:04 haochenz96

It is possible that indel AD counts will be improved in this pull request https://github.com/samtools/bcftools/pull/1679

pd3 avatar May 02 '22 12:05 pd3

Thank you! I assume I would clone from here? https://github.com/jkbonfield/bcftools/tree/indel-tweak-jkb1

haochenz96 avatar May 03 '22 01:05 haochenz96

That's correct. The pull request will be integrated into develop, but may take a while.

pd3 avatar May 05 '22 14:05 pd3

Hey! Thanks again for the info. While waiting for the pull request, I came up with a temporary (ugly) solution. See here my code:

        # @STEP1 mpileup and call
        bcftools mpileup \
            -Ou \
            -R {params.PANEL_BED} \
            -f {params.REF_GENOME} \
            --annotate FORMAT/AD,FORMAT/DP,INFO/AD \
            --max-depth 100000 \
            --max-idepth 100000 \
            --no-BAQ \
            {input.SC_BAM} | \
        bcftools call \
            --keep-alts \
            -C alleles \
            -T {input.CANDIDATE_ALLELE} \
            --multiallelic-caller \
            -Oz \
            -o {output.SC_MPILEUP_VCF} && \
        tabix {output.SC_MPILEUP_VCF}

        # @STEP2 get raw counts
        # extract INFO/AD into a tab-delimited annotation file
        SAMPLE_NAME=$(bcftools query -l {output.SC_MPILEUP_VCF})

        bcftools query -f '%CHROM\t%POS\t%AD{{0}},%AD{{1}}\n' {output.SC_MPILEUP_VCF} | bgzip -c > $OUT_DIR/annot.txt.gz && \
        tabix -s1 -b2 -e2 $OUT_DIR/annot.txt.gz

        echo -e '##FORMAT=<ID=AD,Number=1,Type=Integer,Description="Total allelic depths (high-quality bases)">' >> $OUT_DIR/hdr.txt
        bcftools annotate \
            -s $SAMPLE_NAME \
            -a $OUT_DIR/annot.txt.gz \
            -h $OUT_DIR/hdr.txt \
            q \
            -Oz \
            -o {output.SC_MPILEUP_VCF_raw} \
            {output.SC_MPILEUP_VCF}

Essentially I let the bcftools call proceed as it does, and then I extract the raw REF & ALT read counts (first two fields of AD) for every mutated site of interest from the VCF output. I assume this is fine since every genomic site I input is biallelic. Do you think this will work?

haochenz96 avatar Jun 01 '22 01:06 haochenz96

Have you been able to apply your workaround successfully, @haochenz96? I see the same issue in my data. However, your workaround does not work for me. I am looking at simple 1-nt insertions or deletions. In all these cases, AC=0 and the INFO/AD field does not show any depth for an alternate allele. My data is hybrid capture, but that should not make a difference. Maybe your CC/AT genotype is a special case?

Example:

bcftools mpileup -B -d 10000 -O u -f genome.fa -R regions.bed --annotate FORMAT/AD,FORMAT/DP,INFO/AD sample.bam | \
     bcftools call -A -O v -C alleles -m -T als.tsv.gz > bcftools.vcf

1	46670206	.	TC	T	281.579	.	DP=839;AD=588,0;MQ0F=0;AC=0;AN=2;DP4=312,276,0,0;MQ=60	GT:PL:DP:AD	0/0:0,255,255:588:588,0

gabeng avatar Oct 31 '22 14:10 gabeng

Hi Ben- check out this issue #1711 regarding multiallelic site. Yours looks like a multiallelic site to me as the DP in FORMAT does not match that in INFO.

Otherwise, feel free to upload a test case.

haochenz96 avatar Oct 31 '22 15:10 haochenz96

@gabeng Indels may be missed because of high depth, try increasing -L

  -L, --max-idepth INT    Maximum per-file depth for INDEL calling [250]

pd3 avatar Nov 02 '22 08:11 pd3

Unfortunately, the -L option does not change the result. I am attaching a small test case with three Indels that should be called. OPS-3885.zip

gabeng avatar Nov 02 '22 19:11 gabeng

Really? I am using the latest github version on the develop branch and this is what I am observing with your data:

$ bcftools mpileup -f hs37d5.fa test.bam -r 1:46670206 -L 1000 -a AD,INFO/AD |
   bcftools call -mv |
   bcftools query -f '%CHROM:%POS %REF %ALT %AD [ %GT] [ %AD]\n'

1:46670206 TCC TC  161,166  0/1  161,166

pd3 avatar Nov 04 '22 14:11 pd3

You are absolutely right, when simply calling any variants in the data, the Indels (or most of them) are correctly called. However, this issue is about the (forced) genotyping at predefined coordinates using the options -C and -T.

gabeng avatar Nov 04 '22 15:11 gabeng

I see it now, thanks for the test case. This should be fixed by https://github.com/samtools/bcftools/commit/815fd54e261ee400010082129ff860ff7023c5d0.

By the way, please open a separate issue next time and reference a related issue rather than append to an existing one.

pd3 avatar Nov 07 '22 09:11 pd3

Sorry, I honestly thought the original issues would be the same as mine because the command line options were similar. I did not notice the two AD values of the Indel in the original post. Thank you for the quick fix. I tried it out and have two questions:

I would have expected that the Indels are called exactly as in the input als.tsv, however, they are not:

# output of current develop branch
1	46670206	.	TCC	TC
1	172328767	.	T	TA
1	204502514	.	TTCT	TTCTGAAACAGGGTCT

# output of bcftools-1.16
1	46670206	.	TC	T
1	172328767	.	T	TA
1	204502514	.	T	TTCTGAAACAGGG

Should the output of bcftools mpileup/call generally be normalized downstream?

Another thing I noticed (and this may be unrelated): when I use smaller target regions with the develop version that are only 25bp +/- around the Indels in the -R option, 2 of 3 Indels are not called. The BED file and the VCF file are attached: original-targets.zip

Is 25bp padding around Indels not sufficient?

gabeng avatar Nov 07 '22 14:11 gabeng

Normalization is a separate step, yes, just stream through bcftools norm. Regarding the difference, maybe the 1.16 version just inserted a missing site via the -i option that it could not match, not sure.

I was not able to reproduce the problem with the missing sites 1:172328767 and 1:204502514. If it's really not working for you with the new develop version, please open a new issue and include the full command.

pd3 avatar Nov 07 '22 14:11 pd3