bcftools icon indicating copy to clipboard operation
bcftools copied to clipboard

Bcftools consensus could do with options to control indels.

Open jkbonfield opened this issue 1 year ago • 2 comments

I'm trying what would appear to be the exact thing this tool was invented for - taking GRCh38 and applying a GIAB truth set VCF file to get a sample specific version of the reference.

$ samtools faidx $HREF38 chr1| bcftools consensus HG002_GRCh38_1_22_v4.2_benchmark.chr1.vcf.gz > _.fa
Applied 319348 variants
$ samtools faidx _.fa
$ samtools faidx _.fa chr1:4371241-4371299
>chr1:4371241-4371299
ACAGCAAGGAAGACTTCACTTAAGGCTATTGCAATGGGGGAGAGAGACTGAGCTCAACC

The first issue is this is really arcane! Why can't I specify a region like every other bcftools command? Producing a subset reference and passing that in on stdin baffling. So -r would be useful.

If I look at this region for GRCh38 it's totally different. I assume this is because the consensus subcommand is incorporating indels too so the sequence offsets start to drift. That's a useful default as it does then represent the true reference for the new sample, but it makes some comparisons hard.

I cannot see any way to get a consensus that's in lock step with reference coordinates (for all the caveats that means regarding indels). There are include/exclude options, but it won't let me do -e TYPE=INDEL as INDEL isn't defined in the VCF header.

Finally, while it has an option to mask out specific regions, the truth sets come with lists of regions to include, not exclude. It would be useful if there was an option to specify an include file instead of exclude. Thanks

jkbonfield avatar Sep 13 '22 09:09 jkbonfield

I found that I can use --mark-ins lc --mark-del "*" and tr -d '[a-z]' to remove the insertions and add back the deletions. That works, but maybe this could be easier or documented as a working example. It's a bit of faffing to stitch sequence together, do a tr, and then fold it back again so every line is the same length and faidx works still.

(Actually long term I need to cope with both ins and dels anyway, but for starters I'm doing something simple)

I'm still stuck on the masking thing. Negating / inversion of a bed file is possibly something in bedtools, but I haven't spotted it yet. Likely quicker to write a 1 liner. I see bcftools sometimes has ^reg_file to negate a region list (filter out instead of filter in). It would be useful if the bed file support had the same ability, and was documented per command rather than assuming people look at the Common Options section. I did try looking for that and it's hard as there are dozens of references to the same bit of text making searching in the man page very time consuming.

jkbonfield avatar Sep 13 '22 11:09 jkbonfield

A related issue to this is for ambiguity codes. The IUPAC codes have always been incomplete. We can use R to mean A or G, but there is no ambiguity code for A or gap. Ie if we have two alleles, one with an insertion and one without, then how do we describe this in the consensus? Is there any way to report this? I cannot see one. It's always felt like a fundamental error in IUPAC that such things were not considered.

For samtools consensus I used lowercase letters for that situation, but we don't have a mask option for this here. Also if we use that then we can't also use --mask-ins which I use to track coordinates for a mapping of consensus back to reference. I have the same problem there with samtools consensus, so maybe the right option is a sibling file recording the reference-to- consensus coordinate mapping.

Rodger Staden's ambiguity codes were close to covering this back in 1979, but he was describing uncertainty in calling rather than heterozygosity in samples so they don't fit well. However I think adding new codes or an additional markup (eg optionally setting the top-bit) could work, or providing an alternative for tracking ref-to-cons.

jkbonfield avatar Sep 16 '22 11:09 jkbonfield