parsnp icon indicating copy to clipboard operation
parsnp copied to clipboard

Invalid VCF output

Open jodyphelan opened this issue 10 months ago • 1 comments

Thanks for a great tool.

At the moment the VCF file produced is not compatible with bcftools. This could suggest it doesn't strictly adhere to the vcf specification. It also seems that it is not outputing the correct reference allele in the VCF. Here is an example you can use to recreate the issue:

# download data
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/900/959/515/GCF_900959515.1_PRJEB31972-1/GCF_900959515.1_PRJEB31972-1_genomic.fna.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/756/095/GCF_016756095.1_ASM1675609v1/GCF_016756095.1_ASM1675609v1_genomic.fna.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/587/595/GCF_016587595.1_ASM1658759v1/GCF_016587595.1_ASM1658759v1_genomic.fna.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/219/285/GCF_002219285.1_ASM221928v1/GCF_002219285.1_ASM221928v1_genomic.fna.gz

# extract the chromosome sequence
samtools faidx GCF_002219285.1_ASM221928v1_genomic.fna NZ_CP015278.1 > ref.fa

# run parsnp
parsnp -r ref.fa -d GCF_016587595.1_ASM1658759v1_genomic.fna GCF_016756095.1_ASM1675609v1_genomic.fna GCF_900959515.1_PRJEB31972-1_genomic.fna --vcf -o parsnp

# test to see if bcftools will load
bcftools view parsnp/parsnp.vcf
# should give "Failed to read from parsnp/parsnp.vcf: unknown file type"

# get reference allele of first variant from vcf file
grep -A 1 CHROM parsnp/parsnp.vcf | tail -1 | cut -f2,4
# should give "77	A"

# check fasta for allele at position 77
samtools faidx ref.fa NZ_CP015278.1:77-77
# should give:
# ">NZ_CP015278.1:77-77
# C"

I checked the first few variants and it looks like the ref allele in the VCF is always the allele at pos-1 in the refrence fasta file.

Version tested: 2.1.1

jodyphelan avatar Apr 01 '25 14:04 jodyphelan

Actually it may be that it is reporting pos+1. I tried v1.7.4 and outputs a different position:

# v2.2.1
ref.fa.ref	77	TGACCCGACC.ACGGACGGCG	A	G	40	PASS	NA	GT	0	1	1	1
# v1.7.4 
NZ_CP015278.1	76	TGACCCGACC.ACGGACGGCG	A	G	40	PASS	NA	GT	0	1	1	1

The reference allele seems to be correct with version 1.74

jodyphelan avatar Apr 01 '25 14:04 jodyphelan

Hi @jodyphelan I've fixed the position issue you described above. I also added the VCF format spec to the first line of the VCF:

##fileformat=VCFv4.2

which fixed the critical bcftools error. Note that bcftools will still throw some warnings, as the VCF is still lacking some descriptions:

[W::vcf_parse] Contig 'contig2' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse_info] INFO 'NA' is not defined in the header, assuming Type=String
[W::vcf_parse_format] FORMAT 'GT' at contig2:55 is not defined in the header, assuming Type=String

These are due to harvest-tools and not parsnp, but we plan to get around this by using a different method to generate VCFs in a future release.

bkille avatar Apr 29 '25 21:04 bkille