cycledash icon indicating copy to clipboard operation
cycledash copied to clipboard

PyVCF failing to parse Strelka VCF

Open ihodes opened this issue 9 years ago • 2 comments

cf. http://cycledash.demeter.hpc.mssm.edu/tasks/249

Traceback (most recent call last):
  File "/home/cycledash/cycledash/venv/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/home/cycledash/cycledash/venv/lib/python2.7/site-packages/celery/app/trace.py", line 437, in __protected_call__
    return self.run(*args, **kwargs)
  File "/home/cycledash/cycledash/workers/genotype_extractor.py", line 49, in extract
    temporary_dir=TEMPORARY_DIR)
  File "/home/cycledash/cycledash/common/relational_vcf.py", line 169, in insert_genotypes_with_copy
    filename = vcf_to_csv(vcfreader, table_cols, None, **kwargs)
  File "/home/cycledash/cycledash/common/relational_vcf.py", line 128, in vcf_to_csv
    relations = records_to_relations(vcfdata, columns, **kwargs)
  File "/home/cycledash/cycledash/common/relational_vcf.py", line 101, in records_to_relations
    for record in records:
  File "/home/cycledash/cycledash/venv/lib/python2.7/site-packages/vcf/parser.py", line 567, in next
    samples = self._parse_samples(row[9:], fmt, record)
  File "/home/cycledash/cycledash/venv/lib/python2.7/site-packages/vcf/parser.py", line 438, in _parse_samples
    self.samples, samples, samp_fmt, samp_fmt._types, samp_fmt._nums, site)
  File "cparse.pyx", line 54, in vcf.cparse.parse_samples (vcf/cparse.c:1512)
ValueError: could not convert string to float: 

ihodes avatar Mar 26 '15 22:03 ihodes

According to @arahuja, Strelka also doesn't validate using https://github.com/EBIvariation/vcf-validator

ihodes avatar Oct 12 '15 15:10 ihodes

Turns out strelka doesn't provide data on all the samples it lists in the header and instead creates empty columns for these additional samples. This is problematic to parse, causing PyVCF to fail (as it expects non-empty columns with the proposed format). Removing columns corresponding to NORMAL.variant2 and TUMOR.variant2 resolves this problem:

$ cut -f1-10,12 -d$'\t' strelka.vcf > fixed_strelka.vcf

armish avatar Nov 16 '15 22:11 armish