cycledash
cycledash copied to clipboard
PyVCF failing to parse Strelka VCF
cf. http://cycledash.demeter.hpc.mssm.edu/tasks/249
Traceback (most recent call last):
File "/home/cycledash/cycledash/venv/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
R = retval = fun(*args, **kwargs)
File "/home/cycledash/cycledash/venv/lib/python2.7/site-packages/celery/app/trace.py", line 437, in __protected_call__
return self.run(*args, **kwargs)
File "/home/cycledash/cycledash/workers/genotype_extractor.py", line 49, in extract
temporary_dir=TEMPORARY_DIR)
File "/home/cycledash/cycledash/common/relational_vcf.py", line 169, in insert_genotypes_with_copy
filename = vcf_to_csv(vcfreader, table_cols, None, **kwargs)
File "/home/cycledash/cycledash/common/relational_vcf.py", line 128, in vcf_to_csv
relations = records_to_relations(vcfdata, columns, **kwargs)
File "/home/cycledash/cycledash/common/relational_vcf.py", line 101, in records_to_relations
for record in records:
File "/home/cycledash/cycledash/venv/lib/python2.7/site-packages/vcf/parser.py", line 567, in next
samples = self._parse_samples(row[9:], fmt, record)
File "/home/cycledash/cycledash/venv/lib/python2.7/site-packages/vcf/parser.py", line 438, in _parse_samples
self.samples, samples, samp_fmt, samp_fmt._types, samp_fmt._nums, site)
File "cparse.pyx", line 54, in vcf.cparse.parse_samples (vcf/cparse.c:1512)
ValueError: could not convert string to float:
According to @arahuja, Strelka also doesn't validate using https://github.com/EBIvariation/vcf-validator
Turns out strelka doesn't provide data on all the samples it lists in the header and instead creates empty columns for these additional samples. This is problematic to parse, causing PyVCF to fail (as it expects non-empty columns with the proposed format). Removing columns corresponding to NORMAL.variant2
and TUMOR.variant2
resolves this problem:
$ cut -f1-10,12 -d$'\t' strelka.vcf > fixed_strelka.vcf