cyvcf2 icon indicating copy to clipboard operation
cyvcf2 copied to clipboard

Crash reading valid but weird VCF

Open davmlaw opened this issue 3 years ago • 7 comments

Hi, I'm getting strange genotype calls and then a crash with my GATK VCF.

The zygosity call is strange, but the VCF passes validation.

I've reduced it down to a few lines (VCF at bottom of issue), so you can reproduce the issue.

import cyvcf2 

print(f"version: {cyvcf2.__version__}") 
reader = cyvcf2.Reader(open("./bad_vcf.vcf")) 
for v in reader: 
    print(v) 
    print(f"gt_types: {v.gt_types}") 

Output:

version: 0.30.4
chr7	42924353	rs552132089;rs771560161;rs869187830;rs370736797	CAA	.	57.23	PASS	.	GT:AD:DP:GQ:PGT:PID:PL:PS	.|.:0:1:.:0|1:42924353_CAAA_C:0:42924353

gt_types: [3]
chr13	37564190	rs1253322554	A	.	109.74	PASS	.	GT:AD:DP:GQ:PGT:PID:PL:PS	.|.:0:.:.:0|1:37564182_A_T:.:37564182

gt_types: [3]

Issue 1 - crash:

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-3-c7bb21f9ef99> in <module>
      3 print(f"version: {cyvcf2.__version__}")
      4 reader = cyvcf2.Reader(open("./bad_vcf.vcf"))
----> 5 for v in reader:
      6     print(v)
      7     print(f"gt_types: {v.gt_types}")

/usr/local/lib/python3.8/dist-packages/cyvcf2/cyvcf2.pyx in cyvcf2.cyvcf2.VCF.__next__()

Exception: error parsing variant with `htslib::bcf_read` error-code: 0

I think it's uninitialised memory as sometimes you get a different exception, ie:

[W::vcf_parse] Contig '' is not defined in the header. (Quick workaround: index the file with tabix.)
	1	.	.	.	.	.	.

gt_types: []
[W::vcf_parse] Contig '
                       3D�' is not defined in the header. (Quick workaround: index the file with tabix.)
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-4-c7bb21f9ef99> in <module>
      4 reader = cyvcf2.Reader(open("./bad_vcf.vcf"))
      5 for v in reader:
----> 6     print(v)
      7     print(f"gt_types: {v.gt_types}")
      8 

/usr/local/lib/python3.8/dist-packages/cyvcf2/cyvcf2.pyx in cyvcf2.cyvcf2.Variant.__str__()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb9 in position 5: invalid start byte

Issue 2

Dunno what GATK is doing here, but ALT is ".", and GT = ".|."

cyVCF calls the genotype as 3 (HOM ALT) - I think it should call this as 2 (UNKNOWN)

VCF

##fileformat=VCFv4.2
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another; will always be heterozygous and is not intended to describe called alleles">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phasing set (typically the position of the first variant in the set)">
##FORMAT=<ID=RGQ,Number=1,Type=Integer,Description="Unconditional reference genotype confidence, encoded as a phred quality -10*log10 p(genotype call is wrong)">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##contig=<ID=chr7,length=159345973>
##contig=<ID=chr13,length=114364328>
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    31_CoriellNA12878
chr7    42924353    rs552132089;rs771560161;rs869187830;rs370736797    CAA    .    57.23    PASS    .    GT:AD:DP:GQ:PGT:PID:PL:PS    .|.:0:1:.:0|1:42924353_CAAA_C:0:42924353
chr13    37564190    rs1253322554    A    .    109.74    PASS    .    GT:AD:DP:GQ:PGT:PID:PL:PS    .|.:0:.:.:0|1:37564182_A_T:.:37564182

davmlaw avatar Jan 13 '21 03:01 davmlaw

just use: reader = cyvcf2.VCF("./bad_vcf.vcf") and this problem goes away.

you might work if you open with "rb", but I never use that so haven't seen this.

brentp avatar Jan 14 '21 17:01 brentp

I think this may be something very insidious going on here.

The following works:

import cyvcf2
  
print(f"version: {cyvcf2.__version__}")

fh = open("./bad_vcf.vcf")
reader = cyvcf2.Reader(fh)
for v in reader:
    print(v)
    print(f"gt_types: {v.gt_types}")

I am not too familiar with CPython vs. Python, but could the fact we cache the file handle returned by open in a local variable cause the garbage collection to be delayed, whereas it is immediately garbage collected in your example?

nh13 avatar Jan 20 '21 09:01 nh13

Some quick googling found maybe a red herring, or maybe not: https://stackoverflow.com/a/8011863

nh13 avatar Jan 20 '21 09:01 nh13

oh, good catch, we probably need a weakref to the file handle.

brentp avatar Jan 20 '21 14:01 brentp

can also use a :

with open("./bad_vcf.vcf") as fh:
    ...

around the entire script so it doesn't get collected

brentp avatar Jan 20 '21 14:01 brentp

@brentp where do you store the weakref? VCF? HtsFile?

nh13 avatar Jan 21 '21 07:01 nh13

I think in the htsfile makes sense but don't have a strong preference.

brentp avatar Jan 21 '21 14:01 brentp