tools
tools copied to clipboard
Testing of pipeline output
(Moving discussion from slack)
tl;dr: Currently, nf-core pipelines merely check that all the processes run without errors, but do not ensure that outputs exactly match some expected, like unit testing. This thread is to address the question: How can we change that?
Related:
- https://github.com/nextflow-io/nf-hack18/issues/7
- https://github.com/nf-core/tools/issues/209
Full discussion:
@stevekm:
is there an example of the pipeline-outputs testing? Our group is gonna start working on something similar and I didn't want to reinvent the wheel too much.. I was gonna end up just using the basic Python unittest suite to run the pipeline then compare output files either against saved fixtures or just check for certain attributes. But wasnt sure if there were other methods in use for that kind of thing prob similar to these methods; https://github.com/stevekm/nextflow-ci
@drpatelh :
Dont think we have anything in place yet to check module outputs yet. Currently only checking whether the module itself is working within NF using a minimal test-dataset. Some examples here: https://github.com/nf-core/modules/pulls Be tricky testing for outputs because there will need to be some sort of tolerance to deal with outputs changing between tool releases.
@ewels
Any inputs or suggestions welcome @stevekm! In short we still need to set this up. Even for the full pipeline runs it would be nice to check for (a) the existence of files after the run and (b) certain attributes. This has been in the back of my mind for ages but I've never got around to figuring out the best way to do it. My preferred implementation would be to have it as a sub-tool of the nf-core tools I think, that's where we keep everything else like this. So if you're going to write something anyway, would be super nice if it could be part of that so that we can use it too :slightly_smiling_face:
@stevekm:
oh unfortunately, whatever I write for work in the short-term is gonna be CWL-centric, since thats the chosen framework here for the time being ... :grimacing: If I do come up with anything useful though I will definitely share
... a few days later ...
@olgabot
Hello all! I’m not sure where to put this but I had a question regarding testing. Currently, the
nextflow run -profile test,docker .
is mostly a “smoke test” to check if all the processes run without errors. But there are no proper “unit tests” to check that the output was actually correct. I recently ran into this problem because we merged in a PR that created some empty output files but the processes still ran fine. There’s this discussion from Nov 2018: https://github.com/nextflow-io/nf-hack18/issues/7 - has there been any progress on making unit test type things for nf-core pipelines?
@ewels
Um, see the discussion about ten messages up.. :point_up: :grin: (No, but it would be good to put something simple in place - suggestions for appropriate frameworks welcome)
@olgabot
Haha thanks. My bad for not seeing it. My personal preference would be to have it be part of nf-core vs nextflow because I’d prefer to write Python over Groovy but that’s just me (edited) What can be problematic is that many tools have some element of stochasticity (e.g. STAR) so either a random seed would need to be set, or the tool would need to tolerate some amount of difference The simplest thing I can think of would be to have several example “results” folders hosted somewhere and assert diff is zero. But that’s the lazy version of me talking The OCD version of me says it should do assert statements in Python like this @stevekm has: https://github.com/stevekm/nextflow-ci/blob/master/test_flagstat.py and add a couple tests to at least assert certain files are nonempty and then check their contents (which I guess is implicit with checking their contents)
@ewels
Let’s switch to #tools
From @ewels
@olgabot - about checking output files Originally I was thinking that we could have a yaml file for each pipeline that lists all of the expected output filenames The minimal test would be to just check that each exists, but we could have keys under each file for different attributes For example
not_empty
,filesize
,num_lines
Each of those could have a tolerance set too - eg num_lines could have tolerance of 6 or whatever Then we could havenf-core check_test_results
or something which takes a results folder and a config file and loops through each one That would be very simple to write but fairly effective I think However, after the discussion on #general, I’m wondering if this is reinventing the wheel and if there’s already a (python) testing framework which has this kind of functionality, which we could use
from @olgabot:
At least for tables/dataframes, pandas has a nice pandas.utils.testing.assert_almost_equal that lets you set a tolerance. Same with numpy
I've had a very brief look around and can't find much existing stuff for testing output files (please post suggestions if you know any!). So I'm getting increasingly keen on writing something to do this. To expand on my slack message above, I could see this working as follows:
New nf-core
tools check command
nf-core check-results <pipeline> <output-folder>
- We have a standard place to keep the output metadata config file in the pipeline. Maybe
assets/test_outputs.yaml
? - Other files can exist as well, but this one should correspond to running with just
-profile test
-
<pipeline>
could be a name of an nf-core pipeline with this file, a directory with a pipeline that has this file, or a metadata file path. -
<output-folder>
should be thepublishDir
results folder for a pipeline run
Metadata syntax
Could look something like this as a rough example (brainstorming here, suggestions welcome):
Simplest possible - just filenames:
- fastqc
- SRR389222_sub1_fastqc.html
- SRR389222_sub2_fastqc.html
- zips:
- SRR389222_sub1_fastqc.zip
- SRR389222_sub2_fastqc.zip
Adding in a few more metadata attributes to check:
- fastqc
- SRR389222_sub1_fastqc.html:
mimetype: text/html
- SRR389222_sub2_fastqc.html:
mimetype: text/html
content: '<tr><td>Sequence length</td><td>36</td></tr>'
- zips:
- SRR389222_sub1_fastqc.zip:
mimetype: application/zip
filesize: 512694
filesize_tolerance: 1000
- SRR389222_sub2_fastqc.zip:
mimetype: application/zip
filesize: 512306
filesize_tolerance_pct: 5
Bonus thoughts
- Python or groovy?
- Easiest route (for me) is the new subcommand above
- Could be nice added value to do this in Groovy instead, and have a
--check-results
pipeline param that enables the check at the end of the run. But probably complicates things and bloats codebase. (Could always add this in later with the same metadata files)
- Ignore unexpected files? (I think so)
- Need a command to build / update these metadata files automatically
I agree that something which integrates as cleanly as possible with the current Nextflow infrastructure would probably be best
It seemed to me that one of the challenges was just in writing the code the check the output of each file type. That's why in my example I did a wrapper around samtools flagstat then stopped there, because the time investment grew rather large to wrap each tool's files in a class just to more easily access it's output. In that regard MultiQC could be useful.
for reference, the method I hacked together with Python unittest
was to just validate the exit status of the pipeline, and do custom parsing of the output files;
https://github.com/stevekm/nextflow-ci/blob/master/test_nextflow.py https://github.com/stevekm/nextflow-ci/blob/master/test_flagstat.py
another reference, I have been working on basic methods to do similar unit testing of CWL workflows here; https://github.com/mskcc/pluto-cwl/blob/master/tests/test_generate_cBioPortal_file_cwl.py
This seems pretty similar to the issue of unit testing Nextflow outputs, if Python were chosen for it.
Now that the new DSL2 + module system is coming out soon, maybe it would be good to wait until the formal release of those? The Nextflow modules especially seem like something that would help a lot to streamline unit testing for Nextflow.
In my CWL examples there, I am taking advantage of the default cwl-runner
's behavior of returns a JSON that gives a definition and description of all the output items defined by the workflow/tool;
https://github.com/mskcc/pluto-cwl/blob/c750d9d23bc56680e7071a9841a50cd0ca6263b6/tests/test_generate_cBioPortal_file_cwl.py#L58-L71
@pditommaso @evanfloden maybe this is a feature we could replicate in Nextflow in order to aid testing like this? Having a method to get a description of the task/workflow outputs like this would help a lot I think. Not sure if this is accessible via Groovy either?
@ewels
The nextflow dump()
channel operator can be used for debugging / testing perhaps.
A few pipelines (eg. sarek) already use this for debugging / introspection. It's not exactly standardised though.
@Emiller88 is working on this, currently in the rnaseq pipeline - https://github.com/nf-core/rnaseq/pull/546
@ewels This is lovely for context!
@maxulysse has also got started on sarek. https://github.com/nf-core/sarek/pull/370
I think my plans for it as of now are to flesh this out on rnaseq in this order
- Covert current CI to pytest-workflow (https://github.com/nf-core/rnaseq/pull/546)
- Add local subworkflow e2e tests
- Add local module tests
I'd like to do those in 3 PRs to keep the scope small.
Then possibly in the future add tests for bin
scripts and local groovy code, but I think those are far off.
The functionality that would be nice here in tools is just a nf-core test <tag> -profile docker
or something similar.
I'd love for any feedback!
I'm following along. I'd love to see an example of how to incorporate unit-testing and/or regression testing into my nextflow DSL2 projects. I'm looking for guidance. I'm happy to follow along with your recommendations; and especially happy to follow along by looking at some example projects.
Where's the best place for me to get started?
As an fyi, I built a similar testing framework for the GenePattern Server platform. Called GpUnit. https://github.com/broadinstitute/GpUnit So as to not reinvent any wheels, at the time I chose to build on top of Ant and Junit. It's still going strong. Some decisions I am happy with, some not so much.
Good ideas:
- don't reinvent the wheel; it's a good idea to integrate with an existing test framework
- declarative, human readable test-cases, committed to GitHub; GpUnit declares tests in a (schema-less) YAML format; Even though the guts of the system is JUnit, it's very simple to create the tests; the implementation of the test framework is decoupled from the declaration of the test cases
- keep it simple; it's easy to get started; by default expect one regression type test, consisting of a known good run of your module; ** the simplest case checks for non-zero exit code ** the next simplest case checks for file names; e.g. did my module output a file named 'my-output.txt'? ** it's also possible to diff actual output with expected output, on a file-by-file basis ** you can also do a recursive diff of an entire directory tree ** it's modular: the diff function can be customized to handle for example small differences in numerical output
Not-so-good ideas:
- each test is a full round-trip to the SaaS server ** this is pretty good for stress testing the server; including REST API, job engine, etc. ** this is a problem if the server is down; tests will fail ** or the module under test is not installed on the server; test will fail
Challenges:
- recursive directory diff is not sufficient to cover all test cases
- documentation. In general it's difficult to document software systems; some of the finer points of the system are buried in the comments in the source code, in slack messages, old email messages, etc.
- make it easy to get started; for example by making a template project that requires one (and only one) test-case to exist and to pass before your CI server gives you the green light
pytest-workflow
is running in production in Sarek now: https://github.com/nf-core/sarek/tree/dev/tests
I think we should add this to the pipeline template.
I think the new cat
of the pytest-workflow failed output was the final piece of the puzzle https://github.com/nf-core/modules/pull/1226
Now being handled with nf-test.