differentialabundance icon indicating copy to clipboard operation
differentialabundance copied to clipboard

Multi-tool functionality and subworkflows as hub of methods

Open suzannejin opened this issue 1 year ago • 16 comments

Goals

  • Make the pipeline possible to run multiple combination of tools (eg. limma + gProfiler2, DEseq2 + GSEA) at once, by:
    • Toolsheet definition
    • Properly handle the channels to do the multiple running
  • Create method class- based subworkflows, as a place to easily add new methods of the same kind:
    • Differential subworkflow that calls DE methods
    • Enrichment subworkflow that calls functional analysis methods
  • Add new methods

Context

There were some effort done in the branch dev-ratio to explore these options. Now the plan is to break down the work into small pieces, clean code, and PR to dev.

Steps needed

  • [x] Component: modules -> new methods for the pipeline
    • [ ] Update nf-core module propd (differential proportionality)
    • [x] nf-core/modules#7195
  • [x] Component: modules -> update to coordinate with the new logic
    • [x] nf-core/modules#7200
  • [x] Component: subworkflows
    • [x] nf-core/differentialabundance#341
      • [x] nf-core/differentialabundance#387
    • [x] nf-core/differentialabundance#384
  • [ ] Pipeline -> need to update code to coordinate with the new logic
    • [x] nf-core/differentialabundance#383
    • [ ] #433
    • [x] #423
    • [ ] #422
    • [ ] #464
    • [ ] #465
    • [ ] #449
    • [ ] Add method comparison plots/reports
    • [ ] #424
  • [ ] Update documentation

Other related features

  • [ ] nf-core/differentialabundance#388
  • [ ] nf-core/modules#7290
  • [x] #363
  • [ ] nf-core/differentialabundance#367
  • [ ] #362

suzannejin avatar Dec 10 '24 09:12 suzannejin

Regarding the "Toolsheet", how does that relate to what we proposed in https://github.com/nf-core/differentialabundance/issues/362?

grst avatar Dec 10 '24 10:12 grst

Regarding the "Toolsheet", how does that relate to what we proposed in #362?

The toolsheet is to decide which DE and functional analysis methods to run. An example is here. This is the default toolsheet where each row is a combination of tools that would make sense to be together.

The idea is that the user can select for example --pathway deseq2_gsea,limma_gprofiler2, then this will run both options at the same time with default parameters for each method (with the possibility to change the parameters by toolsheet or cmd flags).

As for your question, the method option in the contrast file could be a way to match between each contrast and the corresponding method to run.

suzannejin avatar Dec 10 '24 11:12 suzannejin

I'm wondering if it wouldn't be more convenient to specify everything in yaml format? Essentially each list item would replace one row in your toolsheet and everything could be specified in one place. YAML seems the more natural choice to me in cases where you have a lot of empty columns in a CSV file otherwise and/or lists of things such as deseq2_gsea,limma_gprofiler2.

I'm also afraid that all the parameters for a differentialabundance run get scattered across too many places... nextflow params, contrasts file, toolsheet file, samplesheet... I'd rather reduce the number of places where to specify parameters.

Something like:

models: 
  - method: limma
    formula: ~ treatment + response
    contrasts:
      - id: treatment_a_vs_b
        type: simple
        comparison: ["treatment", "A", "B"]
    enrichment: 
      - gsea
      - gprofiler2    
  -  method: propd
     permutations: 100
     contrasts: 
      - id: treatment
        type: anova
        column: treatment
   - compositional: propr
      metric: rho

This obviously needs to be fleshed out in more detail. For this it would be important to understand which of the workflows depends on each other. I guess the compositional workflow is completely separate from the differential workflow. The enrichment workflow could be independent when working on the expression data, but it could also work off a ranked gene list generated by the differential workflow.

grst avatar Dec 10 '24 12:12 grst

I'm wondering if it wouldn't be more convenient to specify everything in yaml format?

I don't have too much of a strong feeling between yaml or csv format. However, merging contrast with toolsheet into one file could become tricky. This is because, when there are many methods available, it is nice to have a 'default' toolsheet as a place to specify all the possible combinations of tools that really make sense to be together from the theoretical perspective. This file will always be there, in the pipeline github. Whereas the contrast file is data specific.

suzannejin avatar Dec 10 '24 16:12 suzannejin

it is nice to have a 'default' toolsheet as a place to specify all the possible combinations of tools that really make sense to be together from the theoretical perspective

What are the implications of this? Would you fail the pipeline if a user specifies an "invalid" combination?

grst avatar Dec 10 '24 16:12 grst

What are the implications of this? Would you fail the pipeline if a user specifies an "invalid" combination?

Don't have a plan for that yet, but one option is to raise a warning that it is a non-tested combination.

Indeed, for benchmark users, we considered the possibility of providing an extra toolsheet with all the rows one wants to benchmark.

suzannejin avatar Dec 10 '24 17:12 suzannejin

I'm also afraid that all the parameters for a differentialabundance run get scattered across too many places... nextflow params, contrasts file, toolsheet file, samplesheet... I'd rather reduce the number of places where to specify parameters.

This is also a concern for us... but for the moment we have not find a better solution. It would be nice to brainstorm at some point and super welcome to contribute if you find a better way :)

suzannejin avatar Dec 10 '24 17:12 suzannejin

This file will always be there, in the pipeline github. Whereas the contrast file is data specific.

Just to clarify again, this will only be in the pipeline and the user specifies the combination of tools using standard params, e.g. --pathway deseq2_gsea,limma_gprofiler2? Or will this be an additional input file for the user?

grst avatar Dec 11 '24 06:12 grst

Just to clarify again, this will only be in the pipeline and the user specifies the combination of tools using standard params, e.g. --pathway deseq2_gsea,limma_gprofiler2? Or will this be an additional input file for the user?

We defined tools = "${projectDir}/assets/tools_samplesheet.csv" in nextflow.config. In theory, users should not provide any additional toolsheet to run the pipeline, but we also don't want to stop the users doing so. Hence, one can still change tools path to a custom toolsheet under their own risk. Do you think this will be a problem?

suzannejin avatar Dec 11 '24 09:12 suzannejin

No, it's all good then. All I wanted to know is that in a standard pipeline run, the user wouldn't be required to specify yet another config file.

As you said, we should still think about how to reduce the number of places where to specify parameters, but that's a topic for a separate issue.

grst avatar Dec 11 '24 09:12 grst

Here I created a meta issue with all the steps/sub-issues needed to achieve what we agreed to do. Let me know what you think and if you would add/modify anything :)

CC @mirpedrol @bjlang @JoseEspinosa @pinin4fjords @WackerO

suzannejin avatar Dec 11 '24 15:12 suzannejin

I'm wondering if it wouldn't be more convenient to specify everything in yaml format?

Since the tool sheet will be read with nf-schema, it can accept both CSV and YAML, so a user could use the one that is more convenient for them.

mirpedrol avatar Dec 11 '24 15:12 mirpedrol

I'm wondering if it wouldn't be more convenient to specify everything in yaml format?

Actually @mirpedrol , if it is in yaml format, does it mean that it would be more flexible, and better allow definitions of optional methods/params?

suzannejin avatar Dec 12 '24 15:12 suzannejin

I would say they are equivalent if we use simple YAML (without nesting), up to a user preference which one is easier to type.

mirpedrol avatar Dec 13 '24 07:12 mirpedrol

@suzannejin, could you please provide a summary where we stand since the last PR has been merged and what are the next steps? @atrigila mentioned that the toolsheet is currently limited to a single row - support for iterating over the toolsheet would be really good to have for us for the functional analyses.

grst avatar Jun 05 '25 14:06 grst

Hello @grst , currently we have done:

  • [x] #383

  • [x] #423

The next step would be:

  • [ ] #422

suzannejin avatar Jun 10 '25 07:06 suzannejin