legacy icon indicating copy to clipboard operation
legacy copied to clipboard

[WIP] CNVkit tool definitions

Open anton-khodak opened this issue 9 years ago • 8 comments

Standing PR to add tool descriptions (created by argparse2cwl) and tests for CNVkit tools .

Issues I encountered on first steps:

  • .gitignore prohibits to add bioinformatics stuff, i.e. test data, to the repository. How is it supposed to be tested then?
  • since I have no experience with bioinformatics yet, I don't know which data to use for running tools. I used random files with proper extensions which I downloaded from the Internet, but that approach doesn't work, for example, I got an error while running:
$ cnvkit.py batch --processes 1 
--normal test-files/s5DE199B-D6AF-C6EC-678A-DEC1179D1B97.fastq 
--fasta test-files/cnvkit-batch/ERCC92.fa 
--targets test-files/InfiniumPsychArray-24v1-1_A1.bed 
-annotate test-files/cnvkit-batch/refFlat.txt 
--split --access test-files/InfiniumPsychArray-24v1-1_A1.bed 
--output-dir . --scatter --diagram
Detected file format: BED
Applying annotations as target names
Splitting large targets
Traceback (most recent call last):
  File "/usr/local/bin/cnvkit.py", line 11, in <module>
    args.func(args)
  File "/usr/local/lib/python3.4/dist-packages/cnvlib/commands.py", line 96, in _cmd_batch
    args.processes, args.count_reads)
  File "/usr/local/lib/python3.4/dist-packages/cnvlib/commands.py", line 138, in batch_make_reference
    else {}))
  File "/usr/local/lib/python3.4/dist-packages/cnvlib/commands.py", line 327, in do_targets
    ['chromosome', 'start', 'end', 'name'])
  File "/usr/local/lib/python3.4/dist-packages/cnvlib/gary.py", line 66, in from_rows
    table = pd.DataFrame.from_records(rows, columns=columns)
  File "/usr/local/lib/python3.4/dist-packages/pandas/core/frame.py", line 939, in from_records
    first_row = next(data)
  File "/usr/local/lib/python3.4/dist-packages/cnvlib/target.py", line 287, in split_targets
    for chrom, start, end, name in region_rows:
  File "/usr/local/lib/python3.4/dist-packages/cnvlib/target.py", line 21, in assign_names
    ref_genes = read_refflat_genes(refflat_fname)
  File "/usr/local/lib/python3.4/dist-packages/cnvlib/target.py", line 80, in read_refflat_genes
    name, _rx, chrom, strand, start, end, _ex = parse_refflat_line(line)
  File "/usr/local/lib/python3.4/dist-packages/cnvlib/target.py", line 133, in parse_refflat_line
    assert len(exons) == int(exon_count), (
TypeError: object of type 'zip' has no len()

I think this error might be caused by irrelevant data.

Also, I couldn't find copy number reference profile sample files (.cnn) at all. If somebody who uses CNVkit frequently could give me a hint where to take proper data, my work in testing would have been much facilitated.

  • I didn't find where tool-name-test.yaml file format is specified. It was intuitively understandable what to write there, but I wish somebody pointed the standard for those files.
  • I didn't work with Docker images before, I need to spend some time learning how to write Dockerfiles.

anton-khodak avatar May 31 '16 17:05 anton-khodak

@anton-khodak Did you look at https://travis-ci.org/common-workflow-language/workflows/builds/134234968 ?

mr-c avatar Jun 08 '16 13:06 mr-c

I think it is fine to just check in the generated descriptions; don't worry about writing a specific test. As long as the generated output parses, that is good enough for now.

mr-c avatar Jun 08 '16 13:06 mr-c

I'm totally with @mr-c, we should focus first on CWL, not on specific tools since the amount of work can be quite substantial. If you want to see whether one of the CNVkit subtools works it's fine to dedicate some focused effort, but by no means aiming to cover the whole suite of tools.

Hope that makes sense ;)

brainstorm avatar Jun 10 '16 13:06 brainstorm

OTOH, for a good example on how to test different tools (in my case SV callers), MetaSV has it quite well wrapped up:

https://github.com/bioinform/metasv

But this is just an example, don't spend too much time looking through it.

brainstorm avatar Jun 10 '16 13:06 brainstorm

@brainstorm , that's great! I misinterpreted the goal of the PR, it was not to pass Travis checks but to merely validate those tools. In that case, I'll fix the job file (@mr-c pointed indirectly on that issue) and push all other tools.

UPD. I should have looked more closely at test/cwltest.py... Travis CI checks the mere validity of tools, not how they are executed (with or without errors).

anton-khodak avatar Jun 10 '16 13:06 anton-khodak

Hi guys, I'm happy to help with testing CNVkit and/or tweaking the test suite to play better with argparse2cwl. You can skip wrapping anything marked "deprecated" (e.g. loh, genome2access), those parts will be removed in the next release. Just let me know anything else you need.

etal avatar Jun 21 '16 14:06 etal

@etal, very happy to have you help Anton with that. I was looking at the outputs generated by argparse2cwl yesterday but since I never used CNVkit before, I'm missing a few bits of domain expertise there, so help is super welcome, thanks!

brainstorm avatar Jun 21 '16 15:06 brainstorm

I've released a new minor version of CNVkit that drops the deprecated parts and introduces a few new options. I think the current CWL wrappers in Anton's repo should still work, but batch has a new --method option that's worth exposing. Let me know if there's anything else I can do to help complete and maintain these wrappers.

etal avatar Sep 19 '16 18:09 etal