bcftools icon indicating copy to clipboard operation
bcftools copied to clipboard

VCF to FASTA

Open ksw9 opened this issue 7 years ago • 8 comments

Hi, I'm wondering if there is a best practice for converting a multi-sample VCF file to a multi-sample FASTA using bcftools. I'm interested in generating a FASTA of variant sites only. I've been using bcftools query to loop over every samples and extract called genotypes:

for samp in $(bcftools query -l ${vcf} ); do printf '>'${samp}'\n' bcftools query -s ${samp} -f '[%TGT]' ${vcf} printf '\n' done > test.fa

This works, except that any sites for which the reference is called has a '.', rather than the ref. allele. Is there a way to extract the ref allele genotype at these sites? Thank you for your help!

ksw9 avatar Oct 13 '17 21:10 ksw9

There is the bcftools +setGT plugin which allows to set all missing alleles to reference. For example this replaces missing genotypes (./.) to phased ref genotypes (0|0)

bcftools +setGT in.vcf -- --target-gt . --new-gt 0p

pd3 avatar Oct 17 '17 07:10 pd3

Fantastic-thank you! I use this in combination with bcftools merge, to generate a fasta from VCF (for a haploid organism) as follows: bcftools +setGT --verbose --threads 4 ${vcf} -- --target-gt . --new-gt 0 | bcftools query -s ${samp} -f '[%TGT]'

I get the output message "Filled 45097 alleles" which is standard error, I believe. Thank you!

ksw9 avatar Oct 17 '17 18:10 ksw9

Hi,

And how can I set all missing genotype to an another symbol like 'N' or '?'

ptranvan avatar Mar 13 '18 11:03 ptranvan

You could also use the bcftools consensus command with the --missing option

pd3 avatar Nov 17 '18 16:11 pd3

Hello, I know this thread is a few years old now, but I want to do a similar thing with my SNPs. Currently, I have a fasta file that has a header and sequence for each sample I was working with. It's something like this:

Sample1 T/TT/TG/GT/TC/CG/GT/TT/TT/TG/G ... Sample2 T/TT/TG/GT/TC/CG/GT/TT/TT/TG/G ... Sample3 T/TT/TG/GT/TC/CG/GT/TT/TT/TG/G ...

Ultimately, I want a fasta file where I don't have both haplotypes. I can use the sed command to arbitrarily choose the first haplotype in all cases or the second haplotype in all cases (i.e. sed -i.bak -E 's/(\w)(/)(\w)/\3/g' $FASTA), but I don't know if that's a legitimate move to make.

I see that bcftools concensus has the "--haplotype" option where you have the option to choose the 1st or 2nd allele from the FORMAT/GT field to use. So it might be legitimate, but I'm not finding clear info about it. Any help would be greatly appreciated. Thanks!

Alternatively, I'm going to explore SNPhylo to create the fasta files I need.

sjfleck avatar Oct 29 '21 16:10 sjfleck

@sjfleck I don't understand what is the question.

pd3 avatar Nov 01 '21 07:11 pd3

Sorry for the poor explanation on my end.

I was looking for a way to create a multi-sequence fasta file from VCF data. One sequence for each sample in my VCF file. I followed what ksw9 did, but I have a fasta file that shows the haplotypes like I shared above. I wasn't sure if there was an easy way to replace the haplotypes with IUPAC degenerate nucleotide codes for each sequence. I seems like a relatively simple code to write using global search and replace commands, but I don't usually trust myself to do it flawlessly.

If this is a feature in bcftools, it would be nice to know how it's done. That said, I was able to use SNPhylo to generate this kind of fasta file for my samples already. I've used bcftools for years and if there was an alternate way to do this using it, I would also like to explore that option as well. Thank you

sjfleck avatar Nov 10 '21 17:11 sjfleck

The format shown above (e.g. T/TT/TG/) is not fasta. BCFtools does not work with that, it only works with VCFs and can apply the variation located within onto a fasta sequence.

pd3 avatar Nov 11 '21 17:11 pd3