bcftools
bcftools copied to clipboard
VCF to FASTA
Hi, I'm wondering if there is a best practice for converting a multi-sample VCF file to a multi-sample FASTA using bcftools. I'm interested in generating a FASTA of variant sites only. I've been using bcftools query to loop over every samples and extract called genotypes:
for samp in $(bcftools query -l ${vcf} ); do printf '>'${samp}'\n' bcftools query -s ${samp} -f '[%TGT]' ${vcf} printf '\n' done > test.fa
This works, except that any sites for which the reference is called has a '.', rather than the ref. allele. Is there a way to extract the ref allele genotype at these sites? Thank you for your help!
There is the bcftools +setGT
plugin which allows to set all missing alleles to reference. For example this replaces missing genotypes (./.
) to phased ref genotypes (0|0
)
bcftools +setGT in.vcf -- --target-gt . --new-gt 0p
Fantastic-thank you! I use this in combination with bcftools merge, to generate a fasta from VCF (for a haploid organism) as follows: bcftools +setGT --verbose --threads 4 ${vcf} -- --target-gt . --new-gt 0 | bcftools query -s ${samp} -f '[%TGT]'
I get the output message "Filled 45097 alleles" which is standard error, I believe. Thank you!
Hi,
And how can I set all missing genotype to an another symbol like 'N' or '?'
You could also use the bcftools consensus
command with the --missing
option
Hello, I know this thread is a few years old now, but I want to do a similar thing with my SNPs. Currently, I have a fasta file that has a header and sequence for each sample I was working with. It's something like this:
Sample1 T/TT/TG/GT/TC/CG/GT/TT/TT/TG/G ... Sample2 T/TT/TG/GT/TC/CG/GT/TT/TT/TG/G ... Sample3 T/TT/TG/GT/TC/CG/GT/TT/TT/TG/G ...
Ultimately, I want a fasta file where I don't have both haplotypes. I can use the sed command to arbitrarily choose the first haplotype in all cases or the second haplotype in all cases (i.e. sed -i.bak -E 's/(\w)(/)(\w)/\3/g' $FASTA), but I don't know if that's a legitimate move to make.
I see that bcftools concensus has the "--haplotype" option where you have the option to choose the 1st or 2nd allele from the FORMAT/GT field to use. So it might be legitimate, but I'm not finding clear info about it. Any help would be greatly appreciated. Thanks!
Alternatively, I'm going to explore SNPhylo to create the fasta files I need.
@sjfleck I don't understand what is the question.
Sorry for the poor explanation on my end.
I was looking for a way to create a multi-sequence fasta file from VCF data. One sequence for each sample in my VCF file. I followed what ksw9 did, but I have a fasta file that shows the haplotypes like I shared above. I wasn't sure if there was an easy way to replace the haplotypes with IUPAC degenerate nucleotide codes for each sequence. I seems like a relatively simple code to write using global search and replace commands, but I don't usually trust myself to do it flawlessly.
If this is a feature in bcftools, it would be nice to know how it's done. That said, I was able to use SNPhylo to generate this kind of fasta file for my samples already. I've used bcftools for years and if there was an alternate way to do this using it, I would also like to explore that option as well. Thank you
The format shown above (e.g. T/TT/TG/
) is not fasta. BCFtools does not work with that, it only works with VCFs and can apply the variation located within onto a fasta sequence.