bcftools
bcftools copied to clipboard
consensus should only use the most frequent alternate allele
I've a vcf entry like:
chr 8083 . GG GAA,GA 40008.9 . AB=0.137227,0.819141;ABP=1627.35,1260.12;AC=1,1;AF=0.5,0.5;AN=2;AO=195,1164;CIGAR=1M1I1X,1M1X;DP=1421;DPB=1521.5;DPRA=0,0;EPP=4.89224,29.8739;EPPR=5.49198;GTI=0;LEN=3,1;MEANALT=4,4;MQM=60,60;MQMR=60;NS=1;NUMALT=2;ODDS=188.157;PAIRED=1,0.993127;PAIREDR=1;PAO=0,0;PQA=0,0;PQR=0;PRO=0;QA=7396,44298;QR=537;RO=14;RPL=113,704;RPP=13.7118,114.076;RPPR=3.63072;RPR=82,460;RUN=1,1;SAF=94,596;SAP=3.55595,4.47287;SAR=101,568;SRF=9;SRP=5.49198;SRR=5;TYPE=complex,snp
Unfortunately, bcftools consensus
chooses the first of allele, hence the less frequent one, in this case. But it should use the most frequent one. The parameter -H, --haplotype
allows some selection but does not allow to select the most frequent allele. It can only select first/last/longer/shorter/etc. allele but sometimes the most frequent one is the first and sometimes the last.
Filtering out low frequent alleles drops the complete line instead of removing one of the alternate alleles.
There are two problems. First, your VCF uses custom annotations and bcftools
does not know which of the alleles is most frequent in the pileup. Second, the program was designed to apply a preprocessed VCF as is, not to make filtering decisions; this would be best done upstream of consensus
.
Which annotations would bcftools
use to calculate the frequency? I've used freebayes to call the variants.
bcftools filter
and bcftools view
omit complete lines with more than one alternative allele even if just one allele should be filtered out. Hence, my current workflow is: vcfbreakmulti <in.vcf> | bcftools view -i '<filter>' /dev/stdin -Oz -o <out.vcf.gz>; tabix <out.vcf.gz>
to preprocess the data. Afterwards, I'm using bcftools consensus
which can't read uncompressed vcf files. Hence, i cannot pipe it directly to bcftools
.
I'd love to see if bcftools
could handle multi-allelic variations correctly.
The term frequency is used in many meanings and none is considered by bcftools consensus
. The tool was written for diploid genomes and in this context it does behave correctly. Anything else it can do is a bonus. To push back a little, why don't you write a simple script to remove the less frequent variant as part of a preprocessing step?
To push back a little, why don't you write a simple script to remove the less frequent variant as part of a preprocessing step?
That's what I do with vcfbreakmulti
as described above. Unfortunately, it creates one more intermediate file because it cannot be piped to bcftools consensus
. Would a PR with an additional choice to --haplotype
be welcome?
I could imagine that we extend the plugin tag2tag
to support a new option --RO-AO-to-AD
, similarly to the existing option --QR-QA-to-QS
. This would add the standard FORMAT/AD annotation reserved by the VCF specification.
Then the bcftools consensus
could be extended to support a new switch which would choose the maximum value of a Number=R tag requested by the user. For example, one possible way to run it would be
bcftools consensus --haplotype 'max(INFO/AD)'
bcftools consensus --haplotype 'max(FORMAT/AD)'
or something like that.