Multi-tool functionality and subworkflows as hub of methods
Goals
- Make the pipeline possible to run multiple combination of tools (eg. limma + gProfiler2, DEseq2 + GSEA) at once, by:
- Toolsheet definition
- Properly handle the channels to do the multiple running
- Create method class- based subworkflows, as a place to easily add new methods of the same kind:
- Differential subworkflow that calls DE methods
- Enrichment subworkflow that calls functional analysis methods
- Add new methods
Context
There were some effort done in the branch dev-ratio to explore these options.
Now the plan is to break down the work into small pieces, clean code, and PR to dev.
Steps needed
- [x] Component: modules -> new methods for the pipeline
- [ ] Update nf-core module propd (differential proportionality)
- [x] nf-core/modules#7195
- [x] Component: modules -> update to coordinate with the new logic
- [x] nf-core/modules#7200
- [x] Component: subworkflows
- [x] nf-core/differentialabundance#341
- [x] nf-core/differentialabundance#387
- [x] nf-core/differentialabundance#384
- [x] nf-core/differentialabundance#341
- [ ] Pipeline -> need to update code to coordinate with the new logic
- [x] nf-core/differentialabundance#383
- [ ] #433
- [x] #423
- [ ] #422
- [ ] #464
- [ ] #465
- [ ] #449
- [ ] Add method comparison plots/reports
- [ ] #424
- [ ] Update documentation
Other related features
- [ ] nf-core/differentialabundance#388
- [ ] nf-core/modules#7290
- [x] #363
- [ ] nf-core/differentialabundance#367
- [ ] #362
Regarding the "Toolsheet", how does that relate to what we proposed in https://github.com/nf-core/differentialabundance/issues/362?
Regarding the "Toolsheet", how does that relate to what we proposed in #362?
The toolsheet is to decide which DE and functional analysis methods to run. An example is here. This is the default toolsheet where each row is a combination of tools that would make sense to be together.
The idea is that the user can select for example --pathway deseq2_gsea,limma_gprofiler2, then this will run both options at the same time with default parameters for each method (with the possibility to change the parameters by toolsheet or cmd flags).
As for your question, the method option in the contrast file could be a way to match between each contrast and the corresponding method to run.
I'm wondering if it wouldn't be more convenient to specify everything in yaml format? Essentially each list item would replace one row in your toolsheet and everything could be specified in one place. YAML seems the more natural choice to me in cases where you have a lot of empty columns in a CSV file otherwise and/or lists of things such as deseq2_gsea,limma_gprofiler2.
I'm also afraid that all the parameters for a differentialabundance run get scattered across too many places... nextflow params, contrasts file, toolsheet file, samplesheet... I'd rather reduce the number of places where to specify parameters.
Something like:
models:
- method: limma
formula: ~ treatment + response
contrasts:
- id: treatment_a_vs_b
type: simple
comparison: ["treatment", "A", "B"]
enrichment:
- gsea
- gprofiler2
- method: propd
permutations: 100
contrasts:
- id: treatment
type: anova
column: treatment
- compositional: propr
metric: rho
This obviously needs to be fleshed out in more detail. For this it would be important to understand which of the workflows depends on each other. I guess the compositional workflow is completely separate from the differential workflow. The enrichment workflow could be independent when working on the expression data, but it could also work off a ranked gene list generated by the differential workflow.
I'm wondering if it wouldn't be more convenient to specify everything in yaml format?
I don't have too much of a strong feeling between yaml or csv format. However, merging contrast with toolsheet into one file could become tricky. This is because, when there are many methods available, it is nice to have a 'default' toolsheet as a place to specify all the possible combinations of tools that really make sense to be together from the theoretical perspective. This file will always be there, in the pipeline github. Whereas the contrast file is data specific.
it is nice to have a 'default' toolsheet as a place to specify all the possible combinations of tools that really make sense to be together from the theoretical perspective
What are the implications of this? Would you fail the pipeline if a user specifies an "invalid" combination?
What are the implications of this? Would you fail the pipeline if a user specifies an "invalid" combination?
Don't have a plan for that yet, but one option is to raise a warning that it is a non-tested combination.
Indeed, for benchmark users, we considered the possibility of providing an extra toolsheet with all the rows one wants to benchmark.
I'm also afraid that all the parameters for a differentialabundance run get scattered across too many places... nextflow params, contrasts file, toolsheet file, samplesheet... I'd rather reduce the number of places where to specify parameters.
This is also a concern for us... but for the moment we have not find a better solution. It would be nice to brainstorm at some point and super welcome to contribute if you find a better way :)
This file will always be there, in the pipeline github. Whereas the contrast file is data specific.
Just to clarify again, this will only be in the pipeline and the user specifies the combination of tools using standard params, e.g. --pathway deseq2_gsea,limma_gprofiler2? Or will this be an additional input file for the user?
Just to clarify again, this will only be in the pipeline and the user specifies the combination of tools using standard params, e.g.
--pathway deseq2_gsea,limma_gprofiler2? Or will this be an additional input file for the user?
We defined tools = "${projectDir}/assets/tools_samplesheet.csv" in nextflow.config.
In theory, users should not provide any additional toolsheet to run the pipeline, but we also don't want to stop the users doing so. Hence, one can still change tools path to a custom toolsheet under their own risk. Do you think this will be a problem?
No, it's all good then. All I wanted to know is that in a standard pipeline run, the user wouldn't be required to specify yet another config file.
As you said, we should still think about how to reduce the number of places where to specify parameters, but that's a topic for a separate issue.
Here I created a meta issue with all the steps/sub-issues needed to achieve what we agreed to do. Let me know what you think and if you would add/modify anything :)
CC @mirpedrol @bjlang @JoseEspinosa @pinin4fjords @WackerO
I'm wondering if it wouldn't be more convenient to specify everything in yaml format?
Since the tool sheet will be read with nf-schema, it can accept both CSV and YAML, so a user could use the one that is more convenient for them.
I'm wondering if it wouldn't be more convenient to specify everything in yaml format?
Actually @mirpedrol , if it is in yaml format, does it mean that it would be more flexible, and better allow definitions of optional methods/params?
I would say they are equivalent if we use simple YAML (without nesting), up to a user preference which one is easier to type.
@suzannejin, could you please provide a summary where we stand since the last PR has been merged and what are the next steps? @atrigila mentioned that the toolsheet is currently limited to a single row - support for iterating over the toolsheet would be really good to have for us for the functional analyses.
Hello @grst , currently we have done:
-
[x] #383
-
[x] #423
The next step would be:
- [ ] #422