bcftools icon indicating copy to clipboard operation
bcftools copied to clipboard

consensus --iupac-codes outputting ambiguity on GT=1/1 sites

Open ribeirots opened this issue 2 years ago • 6 comments

Hi all,

This is similar to an issue posted before to which I didn't find a solution. I am using bcftools consensus with --iupac-codes and the output has ambiguity codes for sites with GT=1/1 that have an alternate allele.

This is an example site: 3 20665665 . G A 895.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=1.595;DP=25;Dels=0.00;FS=0.000;HaplotypeScore=3.5764;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;MQRankSum=-1.733;QD=33.09;ReadPosRankSum=-1.179 GT:AD:DP:GQ:PL 1/1:1,24:25:60:924,60,0

Instead of output A my fasta file has output R for this position.

This is the command I used: bcftools consensus -I -m maskedSites.vcf -f Ref.fasta A.vcf.gz > A.fasta

The site shown above is not masked in the maskedSites.vcf file.

This issue was mentioned before in a comment to a closed issue:

Originally posted by @Holusovak in https://github.com/samtools/bcftools/issues/1400#issuecomment-1123330254

ribeirots avatar Jul 19 '22 23:07 ribeirots

The -I option by default does not use genotypes, only looks at REF, ALT columns, this should be requested with -H I instead.

This is not well documented.

pd3 avatar Jul 20 '22 12:07 pd3

Thank for reply. Do you have any tip how create consensus sequence included SNPs and INDEL information?

Holusovak avatar Jul 20 '22 15:07 Holusovak

Can you explain what you mean? The program does include both SNPs and indels, not sure what you mean.

pd3 avatar Jul 20 '22 15:07 pd3

By replacing -I with -H I on the command above I get the following output:

Can't apply 25-th haplotype at 1:40524

This is the site in the vcf file:

1 40524 . T C 34.74 . AC=2;AF=1.00;AN=2;DP=2;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=30.73;MQ0=0;QD=17.37 GT:AD:DP:GQ:PL 1/1:0,2:2:6:62,6,0

Thank you for the quick reply.

ribeirots avatar Jul 20 '22 17:07 ribeirots

If the indel and SNP are close to each other or they are cover, the folowing SNP is not in vcf file using this command: bcftools mpileup -Ou -f REF.fasta Input.bam | bcftools call -Ou -mv | bcftools norm -fREF.fasta -Oz -o Output.Vcf.gz For my example on vcf flie there are only line for Indel: chr_4 4523733 . CCT C 222 . INDEL;IDV=21;IMF=0.677419;DP=31;VDB=0.741682;SGB=-0.692067;MQSB=1;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=4,7,9,11;MQ=60 GT:PL 0/1:255,0,255 But from BAM file I know that on the position 4523736 is A/T combination. But thanks to indel, samtools ignore position 4523736. So is the solution working only with the SNP and lost INDEL information?

Holusovak avatar Jul 21 '22 05:07 Holusovak

@ribeirots Can you show the full command and ideally provide a small test case to reproduce the problem? Seeing the message, 25th(!) haplotype does not make sense and is either a result of incorrect usage or some kind of memory corruption, definitely odd.

@Holusovak Can you please open a separate issue? Your question is about variant calling and is unrelated to this thread. A small test case would be helpful (a slice of the bam).

pd3 avatar Aug 15 '22 10:08 pd3