cyvcf2
cyvcf2 copied to clipboard
Crash reading valid but weird VCF
Hi, I'm getting strange genotype calls and then a crash with my GATK VCF.
The zygosity call is strange, but the VCF passes validation.
I've reduced it down to a few lines (VCF at bottom of issue), so you can reproduce the issue.
import cyvcf2
print(f"version: {cyvcf2.__version__}")
reader = cyvcf2.Reader(open("./bad_vcf.vcf"))
for v in reader:
print(v)
print(f"gt_types: {v.gt_types}")
Output:
version: 0.30.4
chr7 42924353 rs552132089;rs771560161;rs869187830;rs370736797 CAA . 57.23 PASS . GT:AD:DP:GQ:PGT:PID:PL:PS .|.:0:1:.:0|1:42924353_CAAA_C:0:42924353
gt_types: [3]
chr13 37564190 rs1253322554 A . 109.74 PASS . GT:AD:DP:GQ:PGT:PID:PL:PS .|.:0:.:.:0|1:37564182_A_T:.:37564182
gt_types: [3]
Issue 1 - crash:
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
<ipython-input-3-c7bb21f9ef99> in <module>
3 print(f"version: {cyvcf2.__version__}")
4 reader = cyvcf2.Reader(open("./bad_vcf.vcf"))
----> 5 for v in reader:
6 print(v)
7 print(f"gt_types: {v.gt_types}")
/usr/local/lib/python3.8/dist-packages/cyvcf2/cyvcf2.pyx in cyvcf2.cyvcf2.VCF.__next__()
Exception: error parsing variant with `htslib::bcf_read` error-code: 0
I think it's uninitialised memory as sometimes you get a different exception, ie:
[W::vcf_parse] Contig '' is not defined in the header. (Quick workaround: index the file with tabix.)
1 . . . . . .
gt_types: []
[W::vcf_parse] Contig '
3D�' is not defined in the header. (Quick workaround: index the file with tabix.)
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-4-c7bb21f9ef99> in <module>
4 reader = cyvcf2.Reader(open("./bad_vcf.vcf"))
5 for v in reader:
----> 6 print(v)
7 print(f"gt_types: {v.gt_types}")
8
/usr/local/lib/python3.8/dist-packages/cyvcf2/cyvcf2.pyx in cyvcf2.cyvcf2.Variant.__str__()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb9 in position 5: invalid start byte
Issue 2
Dunno what GATK is doing here, but ALT is ".", and GT = ".|."
cyVCF calls the genotype as 3 (HOM ALT) - I think it should call this as 2 (UNKNOWN)
VCF
##fileformat=VCFv4.2
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another; will always be heterozygous and is not intended to describe called alleles">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phasing set (typically the position of the first variant in the set)">
##FORMAT=<ID=RGQ,Number=1,Type=Integer,Description="Unconditional reference genotype confidence, encoded as a phred quality -10*log10 p(genotype call is wrong)">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##contig=<ID=chr7,length=159345973>
##contig=<ID=chr13,length=114364328>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 31_CoriellNA12878
chr7 42924353 rs552132089;rs771560161;rs869187830;rs370736797 CAA . 57.23 PASS . GT:AD:DP:GQ:PGT:PID:PL:PS .|.:0:1:.:0|1:42924353_CAAA_C:0:42924353
chr13 37564190 rs1253322554 A . 109.74 PASS . GT:AD:DP:GQ:PGT:PID:PL:PS .|.:0:.:.:0|1:37564182_A_T:.:37564182
just use: reader = cyvcf2.VCF("./bad_vcf.vcf")
and this problem goes away.
you might work if you open with "rb"
, but I never use that so haven't seen this.
I think this may be something very insidious going on here.
The following works:
import cyvcf2
print(f"version: {cyvcf2.__version__}")
fh = open("./bad_vcf.vcf")
reader = cyvcf2.Reader(fh)
for v in reader:
print(v)
print(f"gt_types: {v.gt_types}")
I am not too familiar with CPython vs. Python, but could the fact we cache the file handle returned by open
in a local variable cause the garbage collection to be delayed, whereas it is immediately garbage collected in your example?
Some quick googling found maybe a red herring, or maybe not: https://stackoverflow.com/a/8011863
oh, good catch, we probably need a weakref to the file handle.
can also use a :
with open("./bad_vcf.vcf") as fh:
...
around the entire script so it doesn't get collected
@brentp where do you store the weakref? VCF? HtsFile?
I think in the htsfile makes sense but don't have a strong preference.