bcftools icon indicating copy to clipboard operation
bcftools copied to clipboard

consensus should only use the most frequent alternate allele

Open tolot27 opened this issue 3 years ago • 5 comments

I've a vcf entry like:

chr	8083	.	GG	GAA,GA	40008.9	.	AB=0.137227,0.819141;ABP=1627.35,1260.12;AC=1,1;AF=0.5,0.5;AN=2;AO=195,1164;CIGAR=1M1I1X,1M1X;DP=1421;DPB=1521.5;DPRA=0,0;EPP=4.89224,29.8739;EPPR=5.49198;GTI=0;LEN=3,1;MEANALT=4,4;MQM=60,60;MQMR=60;NS=1;NUMALT=2;ODDS=188.157;PAIRED=1,0.993127;PAIREDR=1;PAO=0,0;PQA=0,0;PQR=0;PRO=0;QA=7396,44298;QR=537;RO=14;RPL=113,704;RPP=13.7118,114.076;RPPR=3.63072;RPR=82,460;RUN=1,1;SAF=94,596;SAP=3.55595,4.47287;SAR=101,568;SRF=9;SRP=5.49198;SRR=5;TYPE=complex,snp

Unfortunately, bcftools consensus chooses the first of allele, hence the less frequent one, in this case. But it should use the most frequent one. The parameter -H, --haplotype allows some selection but does not allow to select the most frequent allele. It can only select first/last/longer/shorter/etc. allele but sometimes the most frequent one is the first and sometimes the last. Filtering out low frequent alleles drops the complete line instead of removing one of the alternate alleles.

tolot27 avatar May 12 '21 13:05 tolot27

There are two problems. First, your VCF uses custom annotations and bcftools does not know which of the alleles is most frequent in the pileup. Second, the program was designed to apply a preprocessed VCF as is, not to make filtering decisions; this would be best done upstream of consensus.

pd3 avatar May 19 '21 12:05 pd3

Which annotations would bcftools use to calculate the frequency? I've used freebayes to call the variants.

bcftools filter and bcftools view omit complete lines with more than one alternative allele even if just one allele should be filtered out. Hence, my current workflow is: vcfbreakmulti <in.vcf> | bcftools view -i '<filter>' /dev/stdin -Oz -o <out.vcf.gz>; tabix <out.vcf.gz> to preprocess the data. Afterwards, I'm using bcftools consensus which can't read uncompressed vcf files. Hence, i cannot pipe it directly to bcftools.

I'd love to see if bcftools could handle multi-allelic variations correctly.

tolot27 avatar May 19 '21 13:05 tolot27

The term frequency is used in many meanings and none is considered by bcftools consensus. The tool was written for diploid genomes and in this context it does behave correctly. Anything else it can do is a bonus. To push back a little, why don't you write a simple script to remove the less frequent variant as part of a preprocessing step?

pd3 avatar May 19 '21 13:05 pd3

To push back a little, why don't you write a simple script to remove the less frequent variant as part of a preprocessing step?

That's what I do with vcfbreakmulti as described above. Unfortunately, it creates one more intermediate file because it cannot be piped to bcftools consensus. Would a PR with an additional choice to --haplotype be welcome?

tolot27 avatar May 19 '21 13:05 tolot27

I could imagine that we extend the plugin tag2tag to support a new option --RO-AO-to-AD, similarly to the existing option --QR-QA-to-QS. This would add the standard FORMAT/AD annotation reserved by the VCF specification.

Then the bcftools consensus could be extended to support a new switch which would choose the maximum value of a Number=R tag requested by the user. For example, one possible way to run it would be

bcftools consensus --haplotype 'max(INFO/AD)'
bcftools consensus --haplotype 'max(FORMAT/AD)'

or something like that.

pd3 avatar May 24 '21 14:05 pd3