cyvcf2 icon indicating copy to clipboard operation
cyvcf2 copied to clipboard

Bug: cyvcf2 does not raise an exception for an improperly formatted VCF

Open nh13 opened this issue 5 years ago • 10 comments
trafficstars

If the PS format field has type Integer but the values are strings, then cyvcf2 will skip the variant but not throw an exception. I would have expected, by default, that the underlying error to propagate up as an exception. I'd be ok with skipping the variant if I had set some option to be "lenient". If I change the type to String all works fine, but the silent error did cause some strange downstream results.

>>> from cyvcf2 import VCF as VcfReader
>>> reader = VcfReader("fail.vcf")
>>> list(reader)
[E::vcf_parse_format] Invalid character 'P' in 'PS' FORMAT field at 1:4001310
[]

You can see the VCF for the test above here:

`fail.vcf`
##fileformat=VCFv4.2
##CL=vcffilter -i filtered-phase-transfer.vcf.gz -o - --javascript "ensureFormatHeader(\"##FORMAT=<ID=PS,Number=1,Type=String,Description=\\\"Phase set for GT\\\">\"); function record() {if(INTEGRATION.GT==\"1/1\") { INTEGRATION.IPS=\".\"; INTEGRATION.PS=\"HOMVAR\"; INTEGRATION.GT=\"1|1\";} else {if((INTEGRATION.GT==\"0/1\" || INTEGRATION.GT==\"1/2\" || INTEGRATION.GT==\"2/1\" || INTEGRATION.GT==\"1/0\") ) {if(INTEGRATION.IPS.length>1) {INTEGRATION.PS=INTEGRATION.IPS; INTEGRATION.GT=INTEGRATION.IGT;} else {INTEGRATION.PS=\".\";};} else { if((INTEGRATION.IPS.length<2)) { INTEGRATION.IPS=\".\";} INTEGRATION.PS=\"PATMAT\";};};}"
##FILTER=<ID=GQlessthan70,Description="Sum of GQ for datasets with this genotype less than 70">
##FILTER=<ID=alleleimbalance,Description="Filtered calls where the net allele balance for unfiltered datasets is <0.2 or >0.8">
##FILTER=<ID=allfilteredanddisagree,Description="All callsets have this call filtered or outside the callable regions and they have discordant genotypes or variant calls">
##FILTER=<ID=allfilteredbutagree,Description="All callsets have this call filtered or outside the callable regions but they have the same genotype">
##FILTER=<ID=cgonly,Description="Filtered calls where only Complete Genomics had this call and it was completely missing from any other callset">
##FILTER=<ID=discordanthet,Description="Filtered calls where a passing call is het and a high GQ but filtered call is hom var, since often the het is wrong">
##FILTER=<ID=discordantunfiltered,Description="Callsets with unfiltered calls have discordant genotypes or variant calls">
##FILTER=<ID=overlappingcall,Description="Filtered sites that are within 50bp of another passing call but none of the callsets that support the 2 calls match">
##FILTER=<ID=questionableindel,Description="Filtered calls where some callsets have a filtered indel larger than 10bp and another dataset has an implied homozygous reference call">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Net allele depths across all unfiltered datasets with called genotype">
##FORMAT=<ID=ADALL,Number=R,Type=Integer,Description="Net allele depths across all datasets">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Total read depth summed across all datasets, excluding MQ0 reads">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Net Genotype quality across all datasets, calculated from GQ scores of callsets supporting the consensus GT, using only one callset from each dataset">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Consensus Genotype across all datasets with called genotype">
##FORMAT=<ID=IGT,Number=1,Type=String,Description="Original input genotype">
##FORMAT=<ID=IPS,Number=1,Type=String,Description="Phase set for IGT">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phasing set (typically the position of the first variant in the set)">
##INFO=<ID=DPSum,Number=1,Type=Integer,Description="Total read depth summed across all datasets, excluding MQ0 reads">
##INFO=<ID=arbitrated,Number=1,Type=String,Description="TRUE if callsets had discordant calls so that arbitration was needed.">
##INFO=<ID=callable,Number=.,Type=String,Description="List of callsets that had this call in a region with low coverage of high MQ reads.">
##INFO=<ID=callsetnames,Number=.,Type=String,Description="Names of callsets that called this genotype, whether filtered or not">
##INFO=<ID=callsets,Number=1,Type=Integer,Description="Number of different callsets that called this genotype, whether filtered or not">
##INFO=<ID=callsetwithotheruniqgenopassing,Number=.,Type=String,Description="Callset that uniquely calls a PASSing genotype different from GT when 2+ PASSing callsets support the genotype in GT.">
##INFO=<ID=callsetwiththisuniqgenopassing,Number=.,Type=String,Description="Callset that uniquely calls the PASSing genotype in GT when 2+ PASSing callsets support a different genotype.">
##INFO=<ID=datasetnames,Number=.,Type=String,Description="Names of datasets for which at least one callset called this genotype, whether filtered or not">
##INFO=<ID=datasets,Number=1,Type=Integer,Description="Number of different datasets for which at least one callset called this genotype, whether filtered or not">
##INFO=<ID=datasetsmissingcall,Number=.,Type=String,Description="Names of datasets that are missing a call or have an incorrect call at this location, and the high-confidence call is a variant">
##INFO=<ID=difficultregion,Number=.,Type=String,Description="List of difficult region bed files containing this call.">
##INFO=<ID=filt,Number=.,Type=String,Description="List of callsets that had this call filtered.">
##INFO=<ID=platformbias,Number=.,Type=String,Description="Names of platforms that have reads containing a variant at this location, but the high-confidence call is homozygous reference, indicating that there is a potential bias.">
##INFO=<ID=platformnames,Number=.,Type=String,Description="Names of platforms for which at least one callset called this genotype, whether filtered or not">
##INFO=<ID=platforms,Number=1,Type=Integer,Description="Number of different platforms for which at least one callset called this genotype, whether filtered or not">
##INFO=<ID=varType,Number=1,Type=String,Description="Type of variant">
##RUN-ID=16dacf15-fdc9-4199-84bd-723ea8bcddef
##contig=<ID=1,length=248956422>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	HG001
1	4001310	.	G	T	50	PASS	callable=CS_HiSeqPE300xGATK_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable;callsetnames=HiSeqPE300xGATK,CGnormal,HiSeqPE300xfreebayes,10XGATKhaplo,SolidSE75GATKHC;callsets=5;datasetnames=HiSeqPE300x,CGnormal,10XChromium,SolidSE75bp;datasets=4;datasetsmissingcall=IonExome,SolidPE50x50bp;platformnames=Illumina,CG,10X,Solid;platforms=4	GT:DP:ADALL:AD:GQ:IGT:IPS:PS	0|1:571:133,124:160,157:1011:0|1:4001310_G_T:PATMAT

You can see the original GRCh38 NA12878 VCF below. You'll see that the type for the PS format field is set as String, which is not valid according to the VCF spec (sigh). I'll have to report that some where. But even so, I'd expected the parsing error to be propagated.

Getting the original VCF
bcftools view https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/latest/GRCh38/HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz chr1:4001310-4001310 | grep PS
##CL=vcffilter -i filtered-phase-transfer.vcf.gz -o - --javascript "ensureFormatHeader(\"##FORMAT=<ID=PS,Number=1,Type=String,Description=\\\"Phase set for GT\\\">\"); function record() {if(INTEGRATION.GT==\"1/1\") { INTEGRATION.IPS=\".\"; INTEGRATION.PS=\"HOMVAR\"; INTEGRATION.GT=\"1|1\";} else {if((INTEGRATION.GT==\"0/1\" || INTEGRATION.GT==\"1/2\" || INTEGRATION.GT==\"2/1\" || INTEGRATION.GT==\"1/0\") ) {if(INTEGRATION.IPS.length>1) {INTEGRATION.PS=INTEGRATION.IPS; INTEGRATION.GT=INTEGRATION.IGT;} else {INTEGRATION.PS=\".\";};} else { if((INTEGRATION.IPS.length<2)) { INTEGRATION.IPS=\".\";} INTEGRATION.PS=\"PATMAT\";};};}"
##INFO=<ID=DPSum,Number=1,Type=Integer,Description="Total read depth summed across all datasets, excluding MQ0 reads">
##FORMAT=<ID=IPS,Number=1,Type=String,Description="Phase set for IGT">
##FORMAT=<ID=PS,Number=1,Type=String,Description="Phase set for GT">
chr1	4001310	.	G	T	50	PASS	platforms=4;platformnames=Illumina,CG,10X,Solid;datasets=4;datasetnames=HiSeqPE300x,CGnormal,10XChromium,SolidSE75bp;callsets=5;callsetnames=HiSeqPE300xGATK,CGnormal,HiSeqPE300xfreebayes,10XGATKhaplo,SolidSE75GATKHC;datasetsmissingcall=IonExome,SolidPE50x50bp;callable=CS_HiSeqPE300xGATK_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable	GT:DP:ADALL:AD:GQ:IGT:IPS:PS	0|1:571:133,124:160,157:1011:0|1:4001310_G_T:PATMAT

nh13 avatar Aug 21 '20 21:08 nh13

hmm. yeah. this needs to be addressed here: https://github.com/brentp/cyvcf2/blob/2dd88453ff4ec619311f860a2df4651f8b2ccf04/cyvcf2/cyvcf2.pyx#L572-L577

htslib returns -1 from bcf_read in this case, and prints a message. cyvcf2 currently just stops iteration.

brentp avatar Aug 21 '20 22:08 brentp

@brentp how about a param to cyvcf2 to either 'stop', 'keep going', or 'throw an exception'? Similar to picard's strict, lenient, or silent.

nh13 avatar Aug 21 '20 22:08 nh13

So it looks like the way to handle this in cyvcf2 is to check the errcode which gets set when there is a problem.

It is set to one of these possible values AFAICT: https://github.com/samtools/htslib/blob/4162046b28a7d9d8a104ce28086d9467cc05c212/htslib/vcf.h#L193-L199

I think we can ignore invalid contig and raise exception on others.

brentp avatar Aug 21 '20 22:08 brentp

i think it's simpler for the user if it fails on a < 0 error except those related to undefined contigs.

brentp avatar Aug 21 '20 22:08 brentp

closed in #162

brentp avatar Aug 25 '20 21:08 brentp

Should this be closed here?

dbolser avatar Sep 26 '23 08:09 dbolser

@nh13 can you add a very simple fail.vcf to the tests and check that it fails correctly?

dbolser avatar Sep 26 '23 08:09 dbolser

@dbolser would you like to reopen this issue, and assign it to me?

nh13 avatar Sep 26 '23 18:09 nh13

@dbolser would you like to reopen this issue, and assign it to me?

Hi @nh13, thanks for getting back to me on this.

I'm not actually a project admin, I just like opensource software community development, so I don't feel shy bossing people about ;-)

@brentp, can you reopen this issue and assign it to @nh13 please?

@nh13, please let us know if you have any problem figuring out how to add the test into the existing framework and how to submit a PR.

Cheers both!

dbolser avatar Sep 27 '23 09:09 dbolser

Thanks, I’m unlikely to get to this anytime soon, so if others want to add the test VCF before I do, I won’t be offended.

nh13 avatar Sep 27 '23 14:09 nh13