modules Add starting deseq2 differential module

This PR adds a wrapper for DESeq2. Notes:

Implemented as a template, initially due to need to include a wrapper script and current lack of module bin directory (see https://nfcore.slack.com/archives/CJRH30T6V/p1665066826554919)
- I think this actually works better than use of a module bin, because we don't need an additional library to provide a CLI! So here we can just use the biocontainer for DESeq2 and nothing else, no need to create bespoke containers etc.
Optional params can be supplied via the usual task.ext.args route, and are handled within the template.
I've defined contrasts with a groovy map, in an analagous way to sample meta in other workflows, comprising:
- variable - a variable from the supplied sample sheet that can be used to group samples
- reference - level of contrast variable indicating 'reference' samples.
- target - level of contrast variable indicating 'target'/ 'treatment' samples.
- blocking - comma-separated list of blocking variables to be included in the model.
A test expression matrix was added to the test data here.
Tiny variations between machines at the 12th decimal point (or so) in some test outputs were causing the CI to fail. So I've just added some very limited rounding to ensure the reproducibility, which can be activated specifically in the CI context.

PR checklist

Tackling https://github.com/nf-core/modules/issues/2155

[x] Adds module for differential expression analysis with DESeq2
[x] If you've fixed a bug or added code that should be tested, add tests!
[x] If you've added a new tool - have you followed the module conventions in the contribution docs
[x] If necessary, include test data in your PR.
[x] Remove all TODO statements.
[x] Emit the versions.yml file.
[x] Follow the naming conventions.
[x] Follow the parameters requirements.
[x] Follow the input/output options guidelines.
[x] Add a resource label
[x] Use BioConda and BioContainers if possible to fulfil software requirements.
Ensure that the test works with either Docker / Singularity. Conda CI tests can be quite flaky:
- [x] PROFILE=docker pytest --tag <MODULE> --symlink --keep-workflow-wd --git-aware
- [ ] PROFILE=singularity pytest --tag <MODULE> --symlink --keep-workflow-wd --git-aware
- [ ] PROFILE=conda pytest --tag <MODULE> --symlink --keep-workflow-wd --git-aware

Oct 06 '22 21:10 pinin4fjords

@nf-core-bot fix-linting

Oct 07 '22 08:10 pinin4fjords

@nf-core-bot fix linting

Oct 07 '22 08:10 pinin4fjords

Did you use the nf-core modules create tool to create this module? The test files are absent in this PR (which is why some of the actions aren't running)

Oct 07 '22 13:10 nvnieuwk

Did you use the nf-core modules create tool to create this module? The test files are absent in this PR (which is why some of the actions aren't running)

I'll get there- just a WIP for visibility and discussion, I'll get things tested before requesting formal review.

Oct 07 '22 13:10 pinin4fjords

Maybe change this to a draft PR then :)

Oct 07 '22 13:10 nvnieuwk

@nf-core-bot fix linting

Oct 07 '22 18:10 pinin4fjords

@nf-core-bot fix linting

Oct 07 '22 18:10 pinin4fjords

@nf-core-bot fix linting

Oct 07 '22 22:10 pinin4fjords

Almost there, just tripping up on reproducibility of results with DESeq I think- puzzled how the same container could be doing something different local and remote. Doing set.seed() in R locally does not change the result files, so I don't think that's a valid fix...

Oct 07 '22 22:10 pinin4fjords

huh, now some of the CI md5sums have flipped back to a state they had previously, so this is not entirely random

Oct 07 '22 23:10 pinin4fjords

Don't worry too much about rounding differences in the output files, that is a common issue. You can just check for a string inside the file instead, or failing that that the file exists at all.

I think we do need optparse here. You can pass the args string along by generating a template file that calls the R script, so you have templates/deseq_de.R and templates/deseq_de.sh,

with templates/deseq_de.sh as:

#!/usr/bin/env sh
deseq_de.R --sample_file !{samplesheet} --counts_file !{counts} !{args}

This will pass the samplesheet and counts file names correctly based on the inputs, and then any other optional arguments from task.ext.args. Then in the main.nf you have:

args = task.ext.args ?: 'blah blah' 
template "deseq_de.sh"

Without optparse, you could manually sort through the args string, but that seems suboptimal to just getting a container with deseq2 and optparse. There is a string suggesting a multi-container exists which would do the job, but I can't figure out how to actually find the resulting name, I get nothing from mulled-search --destination quay singularity conda --search deseq2 | grep "mulled".

It is possible other people disagree with me on this, but I don't think these parameters should be global. What happens if you want to run DESeq twice with different parameters each time?

Oct 08 '22 21:10 SPPearce

Thanks @SPPearce - appreciate the info, the rationale is clearer to me now, and it makes sense not to have the global parameters. Will update the PR notes to match the below.

An optional and very minor reduction in precision on the output numbers (still have 8 decimal places!) has addressed the CI reproducibility issues, no no further action required there I think.

In https://github.com/nf-core/modules/pull/2167/commits/4b4adfc44fce37c80e64631be600a7fba05130e4 I take the overrides from task.ext.args and parse . Doing a def args doesn't seem to work for templates since a newargs wasn't visible from within the template. Re-assigning task.ext.args worked, but seemed messy, and handling the overrides directly in R seemed neater anyway. All the optional params can now be supplied via -- options.

I hope you'll forgive my persistence in being an optparse refusenik. I can see myself making quite a few of these based on different R packages and I'd really like to avoid having to create a new container each time just to include optparse. We don't need most of the functionality anyway, since users won't be calling the script directly, and the basic parsing is pretty trivial to implement.

Oct 09 '22 00:10 pinin4fjords

I have no qualms about you using optparse or not personally, I definitely agree that it is much easier to use predefined containers! Although I struggle writing anything in base-R anymore without tidyverse available to me nowadays. And it looks like you have clean enough code for performing that, does it error if you provide an unexpected argument (i.e. a typo)? Are you expecting the contrast_meta to be constructed from rows of a csv file? I'll take a proper look tomorrow and make suggestions.

Oct 09 '22 07:10 SPPearce

does it error if you provide an unexpected argument (i.e. a typo)?

Yep! https://github.com/nf-core/modules/blob/d59f148cb848b4afabc6d6c2041651f28610bddb/modules/nf-core/deseq2/differential/templates/deseq_de.r#L67

Are you expecting the contrast_meta to be constructed from rows of a csv file?

Yep, that's what I have locally

I'll take a proper look tomorrow and make suggestions.

Thanks!

Oct 09 '22 09:10 pinin4fjords

Does this shrink the log2FC values automatically? I can't see it. I'm pretty sure it happens in DESeq2() for calculating the p values etc, but I haven't applied it to the fold changes in the output table (like this?).

If you'd like that to happen should it be done all the time, or on request @SPPearce ?

Oct 10 '22 10:10 pinin4fjords

Does this shrink the log2FC values automatically? I can't see it. I'm pretty sure it happens in DESeq2() for calculating the p values etc, but I haven't applied it to the fold changes in the output table (like this?).

If you'd like that to happen should it be done all the time, or on request @SPPearce ?

Mmm, I don't know. I'm not an expert in RNA-seq at all, mostly I work on DNA. I just know that it is something that I've done in the few times I've done any. I would ask people with more experience than me of using DESeq2.

Oct 10 '22 12:10 SPPearce

Does this shrink the log2FC values automatically? I can't see it. I'm pretty sure it happens in DESeq2() for calculating the p values etc, but I haven't applied it to the fold changes in the output table (like this?).

If you'd like that to happen should it be done all the time, or on request @SPPearce ?

Mmm, I don't know. I'm not an expert in RNA-seq at all, mostly I work on DNA. I just know that it is something that I've done in the few times I've done any. I would ask people with more experience than me of using DESeq2.

Sounds like something that should be controllable behaviour, anyway, whatever the default behaviour.

Oct 10 '22 12:10 pinin4fjords

Does this shrink the log2FC values automatically? I can't see it.

I'm going to revise my earlier response with this from the DESeq docs:

"An option in DESeq2 is to provide maximum a posteriori estimates of the log2 fold changes in βi after incorporating a zero-centered Normal prior (betaPrior). While previously, these moderated, or shrunken, estimates were generated by DESeq or nbinomWaldTest functions, they are now produced by the lfcShrink function"

So I need to look into how to run the shrinkage effectively, I've a nasty feeling it's going to need something not in the DESeq2 biocontainer.

Edit: default behaviour of lfcShrink() requires a missing dependency, which doesn't seem sensible to me. I'm proposing an update on the biocontainer: https://github.com/bioconda/bioconda-recipes/pull/37391.

Oct 11 '22 09:10 pinin4fjords

(note: singularity failure is just because we're waiting for a build to propagate from biocontainers)

Oct 11 '22 12:10 pinin4fjords

Thanks @ggabernet! I'm just going to switch the inputs to be keyed by the meta- as per feedback in https://github.com/nf-core/modules/pull/2321, will do that quickly now.

Oct 20 '22 09:10 pinin4fjords

One thing that could be done to also test the blocking variables and that the execution with multiple contrasts works fine would be to add another contrast in the example data with a blocking variable provided.

@ggabernet Good point- modifications to the test data here: https://github.com/nf-core/test-datasets/pull/670

Oct 20 '22 10:10 pinin4fjords

Actually this would be important only in the case that the files are initially provided to the pipeline with a samplesheet (e.g. multiple fastq files to analyze, each of them with individual sample metadata that is passed to the meta map). In the case of the differentialabundance pipeline, we expect to have just 1 count table with all samples to analyze. The samplesheet containing info about the different samples will be provided to the module input. Therefore I would argue that we will not need the meta map here.

In the case of the other module that you mentioned #2321 that operates on reference data (e.g. reference GTF or fasta) and not on sample (sequencing) files, the meta maps would not be needed either.

Oct 20 '22 11:10 ggabernet

Thanks @ggabernet (and apologies I just saw this). I was confused in the other review - it seemed like the meta thing was pretty non-negotiable (though as it turned out I think there was a use for it there).

In the latest commits I've put the inputs in a single tuple with the 'contrast meta' as the 'meta' rather than a separate channel for the contrast meta, which I think is nicer, but happy to reverse that if you like.

(and the additional contrast with blocking factor is now tested too)

Oct 20 '22 12:10 pinin4fjords

Thanks for the review @ggabernet !

Oct 20 '22 13:10 pinin4fjords

modules modules copied to clipboard

Add starting deseq2 differential module

PR checklist

modules
modules copied to clipboard