modules
modules copied to clipboard
Add starting deseq2 differential module
This PR adds a wrapper for DESeq2. Notes:
- Implemented as a template, initially due to need to include a wrapper script and current lack of module
bin
directory (see https://nfcore.slack.com/archives/CJRH30T6V/p1665066826554919)- I think this actually works better than use of a module
bin
, because we don't need an additional library to provide a CLI! So here we can just use the biocontainer for DESeq2 and nothing else, no need to create bespoke containers etc.
- I think this actually works better than use of a module
- Optional params can be supplied via the usual
task.ext.args
route, and are handled within the template. - I've defined contrasts with a groovy map, in an analagous way to sample meta in other workflows, comprising:
- variable - a variable from the supplied sample sheet that can be used to group samples
- reference - level of contrast variable indicating 'reference' samples.
- target - level of contrast variable indicating 'target'/ 'treatment' samples.
- blocking - comma-separated list of blocking variables to be included in the model.
- A test expression matrix was added to the test data here.
- Tiny variations between machines at the 12th decimal point (or so) in some test outputs were causing the CI to fail. So I've just added some very limited rounding to ensure the reproducibility, which can be activated specifically in the CI context.
PR checklist
Tackling https://github.com/nf-core/modules/issues/2155
- [x] Adds module for differential expression analysis with DESeq2
- [x] If you've fixed a bug or added code that should be tested, add tests!
- [x] If you've added a new tool - have you followed the module conventions in the contribution docs
- [x] If necessary, include test data in your PR.
- [x] Remove all TODO statements.
- [x] Emit the
versions.yml
file. - [x] Follow the naming conventions.
- [x] Follow the parameters requirements.
- [x] Follow the input/output options guidelines.
- [x] Add a resource
label
- [x] Use BioConda and BioContainers if possible to fulfil software requirements.
- Ensure that the test works with either Docker / Singularity. Conda CI tests can be quite flaky:
- [x]
PROFILE=docker pytest --tag <MODULE> --symlink --keep-workflow-wd --git-aware
- [ ]
PROFILE=singularity pytest --tag <MODULE> --symlink --keep-workflow-wd --git-aware
- [ ]
PROFILE=conda pytest --tag <MODULE> --symlink --keep-workflow-wd --git-aware
- [x]
@nf-core-bot fix-linting
@nf-core-bot fix linting
Did you use the nf-core modules create
tool to create this module? The test files are absent in this PR (which is why some of the actions aren't running)
Did you use the
nf-core modules create
tool to create this module? The test files are absent in this PR (which is why some of the actions aren't running)
I'll get there- just a WIP for visibility and discussion, I'll get things tested before requesting formal review.
Maybe change this to a draft PR then :)
@nf-core-bot fix linting
@nf-core-bot fix linting
@nf-core-bot fix linting
Almost there, just tripping up on reproducibility of results with DESeq I think- puzzled how the same container could be doing something different local and remote. Doing set.seed()
in R locally does not change the result files, so I don't think that's a valid fix...
huh, now some of the CI md5sums have flipped back to a state they had previously, so this is not entirely random
Don't worry too much about rounding differences in the output files, that is a common issue. You can just check for a string inside the file instead, or failing that that the file exists at all.
I think we do need optparse
here. You can pass the args
string along by generating a template file that calls the R script, so you have templates/deseq_de.R
and templates/deseq_de.sh
,
with templates/deseq_de.sh
as:
#!/usr/bin/env sh
deseq_de.R --sample_file !{samplesheet} --counts_file !{counts} !{args}
This will pass the samplesheet and counts file names correctly based on the inputs, and then any other optional arguments from task.ext.args
.
Then in the main.nf you have:
args = task.ext.args ?: 'blah blah'
template "deseq_de.sh"
Without optparse
, you could manually sort through the args
string, but that seems suboptimal to just getting a container with deseq2 and optparse. There is a string suggesting a multi-container exists which would do the job, but I can't figure out how to actually find the resulting name, I get nothing from mulled-search --destination quay singularity conda --search deseq2 | grep "mulled"
.
It is possible other people disagree with me on this, but I don't think these parameters should be global. What happens if you want to run DESeq twice with different parameters each time?
Thanks @SPPearce - appreciate the info, the rationale is clearer to me now, and it makes sense not to have the global parameters. Will update the PR notes to match the below.
An optional and very minor reduction in precision on the output numbers (still have 8 decimal places!) has addressed the CI reproducibility issues, no no further action required there I think.
In https://github.com/nf-core/modules/pull/2167/commits/4b4adfc44fce37c80e64631be600a7fba05130e4 I take the overrides from task.ext.args
and parse . Doing a def args
doesn't seem to work for templates since a newargs
wasn't visible from within the template. Re-assigning task.ext.args worked, but seemed messy, and handling the overrides directly in R seemed neater anyway. All the optional params can now be supplied via --
options.
I hope you'll forgive my persistence in being an optparse refusenik. I can see myself making quite a few of these based on different R packages and I'd really like to avoid having to create a new container each time just to include optparse. We don't need most of the functionality anyway, since users won't be calling the script directly, and the basic parsing is pretty trivial to implement.
I have no qualms about you using optparse or not personally, I definitely agree that it is much easier to use predefined containers! Although I struggle writing anything in base-R anymore without tidyverse available to me nowadays. And it looks like you have clean enough code for performing that, does it error if you provide an unexpected argument (i.e. a typo)? Are you expecting the contrast_meta to be constructed from rows of a csv file? I'll take a proper look tomorrow and make suggestions.
does it error if you provide an unexpected argument (i.e. a typo)?
Yep! https://github.com/nf-core/modules/blob/d59f148cb848b4afabc6d6c2041651f28610bddb/modules/nf-core/deseq2/differential/templates/deseq_de.r#L67
Are you expecting the contrast_meta to be constructed from rows of a csv file?
Yep, that's what I have locally
I'll take a proper look tomorrow and make suggestions.
Thanks!
Does this shrink the log2FC values automatically? I can't see it. I'm pretty sure it happens in
DESeq2()
for calculating the p values etc, but I haven't applied it to the fold changes in the output table (like this?).
If you'd like that to happen should it be done all the time, or on request @SPPearce ?
Does this shrink the log2FC values automatically? I can't see it. I'm pretty sure it happens in
DESeq2()
for calculating the p values etc, but I haven't applied it to the fold changes in the output table (like this?).If you'd like that to happen should it be done all the time, or on request @SPPearce ?
Mmm, I don't know. I'm not an expert in RNA-seq at all, mostly I work on DNA. I just know that it is something that I've done in the few times I've done any. I would ask people with more experience than me of using DESeq2.
Does this shrink the log2FC values automatically? I can't see it. I'm pretty sure it happens in
DESeq2()
for calculating the p values etc, but I haven't applied it to the fold changes in the output table (like this?).If you'd like that to happen should it be done all the time, or on request @SPPearce ?
Mmm, I don't know. I'm not an expert in RNA-seq at all, mostly I work on DNA. I just know that it is something that I've done in the few times I've done any. I would ask people with more experience than me of using DESeq2.
Sounds like something that should be controllable behaviour, anyway, whatever the default behaviour.
Does this shrink the log2FC values automatically? I can't see it.
I'm going to revise my earlier response with this from the DESeq docs:
"An option in DESeq2 is to provide maximum a posteriori estimates of the log2 fold changes in βi after incorporating a zero-centered Normal prior (betaPrior). While previously, these moderated, or shrunken, estimates were generated by DESeq or nbinomWaldTest functions, they are now produced by the lfcShrink function"
So I need to look into how to run the shrinkage effectively, I've a nasty feeling it's going to need something not in the DESeq2 biocontainer.
Edit: default behaviour of lfcShrink()
requires a missing dependency, which doesn't seem sensible to me. I'm proposing an update on the biocontainer: https://github.com/bioconda/bioconda-recipes/pull/37391.
(note: singularity failure is just because we're waiting for a build to propagate from biocontainers)
Thanks @ggabernet! I'm just going to switch the inputs to be keyed by the meta- as per feedback in https://github.com/nf-core/modules/pull/2321, will do that quickly now.
One thing that could be done to also test the blocking variables and that the execution with multiple contrasts works fine would be to add another contrast in the example data with a blocking variable provided.
@ggabernet Good point- modifications to the test data here: https://github.com/nf-core/test-datasets/pull/670
Actually this would be important only in the case that the files are initially provided to the pipeline with a samplesheet (e.g. multiple fastq files to analyze, each of them with individual sample metadata that is passed to the meta
map).
In the case of the differentialabundance pipeline, we expect to have just 1 count table with all samples to analyze. The samplesheet containing info about the different samples will be provided to the module input. Therefore I would argue that we will not need the meta
map here.
In the case of the other module that you mentioned #2321 that operates on reference data (e.g. reference GTF
or fasta
) and not on sample (sequencing) files, the meta maps would not be needed either.
Thanks @ggabernet (and apologies I just saw this). I was confused in the other review - it seemed like the meta thing was pretty non-negotiable (though as it turned out I think there was a use for it there).
In the latest commits I've put the inputs in a single tuple with the 'contrast meta' as the 'meta' rather than a separate channel for the contrast meta, which I think is nicer, but happy to reverse that if you like.
(and the additional contrast with blocking factor is now tested too)
Thanks for the review @ggabernet !