legacy
legacy copied to clipboard
[WIP] CNVkit tool definitions
Standing PR to add tool descriptions (created by argparse2cwl) and tests for CNVkit tools .
Issues I encountered on first steps:
- .gitignore prohibits to add bioinformatics stuff, i.e. test data, to the repository. How is it supposed to be tested then?
- since I have no experience with bioinformatics yet, I don't know which data to use for running tools. I used random files with proper extensions which I downloaded from the Internet, but that approach doesn't work, for example, I got an error while running:
$ cnvkit.py batch --processes 1
--normal test-files/s5DE199B-D6AF-C6EC-678A-DEC1179D1B97.fastq
--fasta test-files/cnvkit-batch/ERCC92.fa
--targets test-files/InfiniumPsychArray-24v1-1_A1.bed
-annotate test-files/cnvkit-batch/refFlat.txt
--split --access test-files/InfiniumPsychArray-24v1-1_A1.bed
--output-dir . --scatter --diagram
Detected file format: BED
Applying annotations as target names
Splitting large targets
Traceback (most recent call last):
File "/usr/local/bin/cnvkit.py", line 11, in <module>
args.func(args)
File "/usr/local/lib/python3.4/dist-packages/cnvlib/commands.py", line 96, in _cmd_batch
args.processes, args.count_reads)
File "/usr/local/lib/python3.4/dist-packages/cnvlib/commands.py", line 138, in batch_make_reference
else {}))
File "/usr/local/lib/python3.4/dist-packages/cnvlib/commands.py", line 327, in do_targets
['chromosome', 'start', 'end', 'name'])
File "/usr/local/lib/python3.4/dist-packages/cnvlib/gary.py", line 66, in from_rows
table = pd.DataFrame.from_records(rows, columns=columns)
File "/usr/local/lib/python3.4/dist-packages/pandas/core/frame.py", line 939, in from_records
first_row = next(data)
File "/usr/local/lib/python3.4/dist-packages/cnvlib/target.py", line 287, in split_targets
for chrom, start, end, name in region_rows:
File "/usr/local/lib/python3.4/dist-packages/cnvlib/target.py", line 21, in assign_names
ref_genes = read_refflat_genes(refflat_fname)
File "/usr/local/lib/python3.4/dist-packages/cnvlib/target.py", line 80, in read_refflat_genes
name, _rx, chrom, strand, start, end, _ex = parse_refflat_line(line)
File "/usr/local/lib/python3.4/dist-packages/cnvlib/target.py", line 133, in parse_refflat_line
assert len(exons) == int(exon_count), (
TypeError: object of type 'zip' has no len()
I think this error might be caused by irrelevant data.
Also, I couldn't find copy number reference profile sample files (.cnn) at all. If somebody who uses CNVkit frequently could give me a hint where to take proper data, my work in testing would have been much facilitated.
- I didn't find where
tool-name-test.yamlfile format is specified. It was intuitively understandable what to write there, but I wish somebody pointed the standard for those files. - I didn't work with Docker images before, I need to spend some time learning how to write Dockerfiles.
@anton-khodak Did you look at https://travis-ci.org/common-workflow-language/workflows/builds/134234968 ?
I think it is fine to just check in the generated descriptions; don't worry about writing a specific test. As long as the generated output parses, that is good enough for now.
I'm totally with @mr-c, we should focus first on CWL, not on specific tools since the amount of work can be quite substantial. If you want to see whether one of the CNVkit subtools works it's fine to dedicate some focused effort, but by no means aiming to cover the whole suite of tools.
Hope that makes sense ;)
OTOH, for a good example on how to test different tools (in my case SV callers), MetaSV has it quite well wrapped up:
https://github.com/bioinform/metasv
But this is just an example, don't spend too much time looking through it.
@brainstorm , that's great! I misinterpreted the goal of the PR, it was not to pass Travis checks but to merely validate those tools. In that case, I'll fix the job file (@mr-c pointed indirectly on that issue) and push all other tools.
UPD. I should have looked more closely at test/cwltest.py... Travis CI checks the mere validity of tools, not how they are executed (with or without errors).
Hi guys, I'm happy to help with testing CNVkit and/or tweaking the test suite to play better with argparse2cwl. You can skip wrapping anything marked "deprecated" (e.g. loh, genome2access), those parts will be removed in the next release. Just let me know anything else you need.
@etal, very happy to have you help Anton with that. I was looking at the outputs generated by argparse2cwl yesterday but since I never used CNVkit before, I'm missing a few bits of domain expertise there, so help is super welcome, thanks!
I've released a new minor version of CNVkit that drops the deprecated parts and introduces a few new options. I think the current CWL wrappers in Anton's repo should still work, but batch has a new --method option that's worth exposing. Let me know if there's anything else I can do to help complete and maintain these wrappers.