bedtools2 icon indicating copy to clipboard operation
bedtools2 copied to clipboard

invalid records when processing vcf file

Open ghost opened this issue 4 years ago • 6 comments

Hello, I have vcf files generated from bcftools convert --gvcf2vcf and this raises an issue

bedtools coverage -a A_vaga.NDPD.bed -b ARC_ancestor.expanded.g.vcf.gz -counts                      
Error: Invalid record in file ARC_ancestor.expanded.g.vcf.gz. Record is 
Chrom_3	1	.	A	<*>	0	.	.	GT:GQ:MIN_DP:PL	0/0:1:0:0,0,0

Here is the header of the generated vcf

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FILTER=<ID=RefCall,Description="Genotyping model thinks this site is reference.">
##FILTER=<ID=LowQual,Description="Confidence in this variant being real is below calling threshold.">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position (for use with symbolic alleles)">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Conditional genotype quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block.">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Read depth for each allele">
##FORMAT=<ID=VAF,Number=A,Type=Float,Description="Variant allele fractions.">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Phred-scaled genotype likelihoods rounded to the closest integer">
##contig=<ID=Chrom_3,length=20354777>
##contig=<ID=Chrom_1,length=18146847>
##contig=<ID=Chrom_5,length=16930519>
##contig=<ID=Chrom_2,length=16274841>
##contig=<ID=Chrom_4,length=15224634>
##contig=<ID=Chrom_6,length=13893210>
##bcftools_convertVersion=1.10.2-27-g9d66868+htslib-1.10.2-32-g84f0a86
##bcftools_convertCommand=convert -f ../A_vaga.NDPD.fasta --gvcf2vcf ARC_ancestor.g.vcf.gz; Date=Tue Mar 17 12:02:46 2020
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	ARC_ancestor

I suppose Bedtools doesn't like the empty FILTER and INFO columns. As far as I can tell the vcf file seems to be a valid vcf.

ghost avatar Mar 19 '20 15:03 ghost

What version of bedtools is this?

arq5x avatar Apr 21 '20 22:04 arq5x

I'm seemingly getting the same issue with the latest git master of bedtools and a vcf file generated with bcftools mpileup 1.9. I've attached the file here. The error I get is:

Error: Invalid record in file WTCHG_230414_256227_ERCC.vcf.gz. Record is 
ERCC-00002      1       .       T       <*>     0       .       DP=144;AD=106,0;I16=106,0,0,0,5217,305911,0,0,6360,381600,0,0,1173,15191,0,0;QS=1,0;MQ0F=0      PL      0,255,255

WTCHG_230414_256227_ERCC.vcf.gz

DaGaMs avatar Jun 03 '20 12:06 DaGaMs

Ah! It's the weird way in which bcftools encodes the symbolic allele. The <*> causes the error message. Maybe something to fix though, since most raw bcftools output includes these "alleles".

DaGaMs avatar Jun 03 '20 12:06 DaGaMs

It seems bedtools expecting either SVLEN or END, but the VCF doesn't have neither of them. I suspect the VCF is used to represent per-base records, thus we could potentially make bedtools understand this convention.

My question is is <*> means the record is per-base ?

38 avatar Jun 03 '20 17:06 38

From a discussion at https://www.biostars.org/p/279971, I see that there is a reference to the newer VCF spec, which states:

.5 Representing unspecified alleles and REF-only blocks (gVCF)

In order to report sequencing data evidence for both variant and non-variant positions in the genome, the VCF specification allows to represent blocks of reference-only calls in a single record using the END INFO tag, an idea originally introduced by the gVCF file format† . The convention adopted here is to represent reference evidence as likelihoods against an unknown alternate allele. Think of this as the likelihood for reference as compared to any other possible alternate allele (both SNP, indel, or otherwise). A symbolic alternate allele <*> is used to represent this unspecified alternate allele

Basically, newer versions of bcftools and samtools report the potential for an alternate allele at every position with the <*> placeholder. After bcftools call, they disappear. But a lot of people (like me) just want to use mpileup to get a quick list of all variant positions, and so we have to deal with the '<*>' manually. It'd be nice to not have to do some sed magic every time, so if this could be added as a valid allele symbol (which it is) in bedtools, then that would be awesome.

DaGaMs avatar Jun 03 '20 20:06 DaGaMs

So, I read this as interpret <> as length of 1 nucleotide if there is no END tag in the INFO field. Otherwise, if END is present in the INFO field and <> is present, then the length of the interval for the VCF record is (END-POS)+1. Agree?

arq5x avatar Jun 04 '20 16:06 arq5x