Allow `dream` to process non-formula contrasts
PR checklist
Closes https://github.com/nf-core/differentialabundance/issues/578
- [ ] This comment contains a description of changes (with reason).
- [ ] If you've fixed a bug or added code that should be tested, add tests!
- [ ] If you've added a new tool - have you followed the module conventions in the contribution docs
- [ ] If necessary, include test data in your PR.
- [ ] Remove all TODO statements.
- [ ] Emit the
versions.ymlfile. - [ ] Follow the naming conventions.
- [ ] Follow the parameters requirements.
- [ ] Follow the input/output options guidelines.
- [ ] Add a resource
label - [ ] Use BioConda and BioContainers if possible to fulfil software requirements.
- Ensure that the test works with either Docker / Singularity. Conda CI tests can be quite flaky:
- For modules:
- [ ]
nf-core modules test <MODULE> --profile docker - [ ]
nf-core modules test <MODULE> --profile singularity - [ ]
nf-core modules test <MODULE> --profile conda
- [ ]
- For subworkflows:
- [ ]
nf-core subworkflows test <SUBWORKFLOW> --profile docker - [ ]
nf-core subworkflows test <SUBWORKFLOW> --profile singularity - [ ]
nf-core subworkflows test <SUBWORKFLOW> --profile conda
- [ ]
- For modules:
I think you still need to take the baseline level into account when building contrasts, as done here for DESeq2:
https://github.com/nf-core/modules/blob/8846461fd688f29921c28dd4182913969a4f36c1/modules/nf-core/deseq2/differential/templates/deseq2_differential.R#L450-L457
DESeq2 allows to test contrasts with a c(variable, target, reference) tuple. I don't know of DREAM offers the same. If it doesn't, you'll need to manually build the contrast or choose appropriate baseline levels for the factor with the contrast variable.
So regarding the eBayes, it is necessary to call it after contrasts.fit, according to this example from the limma user guide:
https://bioconductor.org/packages//2.11/bioc/vignettes/limma/inst/doc/usersguide.pdf
It wouldn't be necessary to call it before contrasts.fit, but it doesn't harm. (we still need it if we don't call contrasts.fit at all, but results() directly on the first fit).
I'll defer to @grst on approval for this
ok, then here are my last two requests:
- [ ] Can you please add a check that (variable, reference, target) contrasts are not combined with a custom formula?
- [ ] remove the special case when factors are missing from the design matrix, just fail instead.
Like that we focus on the usecases we currently need for differentialabundance and keep it simple.
@grst
ok, then here are my last two requests:
- [ ] Can you please add a check that (variable, reference, target) contrasts are not combined with a custom formula?
- [x] remove the special case when factors are missing from the design matrix, just fail instead.
Like that we focus on the usecases we currently need for differentialabundance and keep it simple.
I will remove the special case now, but I am not sure how to approach the check for the contrast tuple since the script currently uses the contrast_variable and contrast_reference to relevel the metadata:
if (!is.null(opt\$contrast_reference) && opt\$contrast_reference != "") {
metadata[[opt\$contrast_variable]] <- factor(metadata[[opt\$contrast_variable]])
metadata[[opt\$contrast_variable]] <- relevel(metadata[[opt\$contrast_variable]], ref = opt\$contrast_reference)
}
I could make the check only for blocking_variables and contrast_reference but then that wouldn't allow subsetting to contrast samples.
Would the check be necessary if we already have a validation where contrast_string must be provided if using formula?
I don't think the relevel or subset to contrast samples should work together with a custom contrast/formula. At the pipeline level, when specifying a formula and a contrast string, contrast variable, target and reference won't be set.
You can write the check such that this possibility is excluded.
Okay, I have added the validation so that the module fails when a formula is used together with the contrast tuple.
Some of the tests are failing because no explicit reference level is set and therefore no releveling is performed, so sometimes the "target" level is absorbed into the intercept by the default:
> ~genotype * treatment
> Warning message:
> Zero sample variances detected, have been offset away from zero
> (Intercept) genotypeWT treatmentTreated genotypeWT:treatmentTreated
> Sample1 1 1 0 0
> Sample2 1 1 0 0
> Sample3 1 1 0 0
> [1] "(Intercept)" "genotypeWT"
> [3] "treatmentTreated" "genotypeWT:treatmentTreated"
> Using contrast string: genotypeKO - treatmentTreated
> Error in eval(ej, envir = levelsenv) : object 'genotypeKO' not found
In this case, genotype (KO) is encoded in the intercept by default and therefore does not exist as a standalone coefficient in the design matrix and cannot be directly referenced in a contrast expression.
@grst Do you think this is behavior that users are generally expected to understand and handle themselves?
yes, users are expected to understand how this works if they use formulas and contrast strings.
One thing we should consider is how to explicitly set the baseline level of the factors (I believe currently it's the default, i.e. alphabetically). But that's a different topic that we should discuss at the pipeline level first.