rnaseq icon indicating copy to clipboard operation
rnaseq copied to clipboard

Workflow output definition

Open bentsherman opened this issue 11 months ago • 21 comments

This PR adds a workflow output definition based on https://github.com/nextflow-io/nextflow/pull/4784. I'm still working through the pipeline, but once I'm done, I will have completely replaced publishDir using the output DSL.

See also https://github.com/nf-core/fetchngs/pull/275 for ongoing discussion

bentsherman avatar Feb 28 '24 22:02 bentsherman

nf-core lint overall result: Passed :white_check_mark: :warning:

Posted for pipeline commit 783ff86

+| ✅ 170 tests passed       |+
#| ❔   7 tests were ignored |#
!| ❗   7 tests had warnings |!

:heavy_exclamation_mark: Test warnings:

  • files_exist - File not found: assets/multiqc_config.yml
  • files_exist - File not found: .github/workflows/awstest.yml
  • files_exist - File not found: .github/workflows/awsfulltest.yml
  • pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline
  • pipeline_todos - TODO string in main.nf: Optionally add in-text citation tools to this list.
  • pipeline_todos - TODO string in main.nf: Optionally add bibliographic entries to this list.
  • pipeline_todos - TODO string in main.nf: Only uncomment below if logic in toolCitationText/toolBibliographyText has been filled!

:grey_question: Tests ignored:

  • files_exist - File is ignored: conf/modules.config
  • nextflow_config - Config default ignored: params.ribo_database_manifest
  • files_unchanged - File ignored due to lint config: assets/email_template.html
  • files_unchanged - File ignored due to lint config: assets/email_template.txt
  • files_unchanged - File ignored due to lint config: .gitignore or .prettierignore or pyproject.toml
  • actions_awstest - 'awstest.yml' workflow not found: /home/runner/work/rnaseq/rnaseq/.github/workflows/awstest.yml
  • multiqc_config - 'assets/multiqc_config.yml' not found

:white_check_mark: Tests passed:

Run details

  • nf-core/tools version 2.13
  • Run at 2024-02-28 22:32:03

github-actions[bot] avatar Feb 28 '24 22:02 github-actions[bot]

Looking at this PR, we would be moving the logic of the publishing from the configuration to the pipeline. I'm quite happy with that, but:

  • I know some people like publishing being a property of configuration (not convinced myself).
  • Could this solution be part of the emit block to create one place for outputs?
  • Could a .publish() operator to achieve the same thing but with a simpler syntax?

adamrtalbot avatar Mar 05 '24 12:03 adamrtalbot

OK I've read this comment which has made it more clear.

We add a second structure to Nextflow, the output block, which includes the definitions about what should be published where. This falls outside of processes and workflows which makes it more flexible. The original idea was to use topics but that was imprecise so process selectors are being used instead.

Taking a look again, I think the general idea is sound but I'm not sold on the syntax. I tried to rewrite it the other way around but ended up just re-inventing publishDir but worse 😆

output {
    select: [
        '.*:QUANTIFY_STAR_SALMON:SALMON_QUANT',
        '.*:QUANTIFY_STAR_SALMON:CUSTOM_TX2GENE',
        '.*:QUANTIFY_STAR_SALMON:TXIMETA_TXIMPORT',
        '.*:QUANTIFY_STAR_SALMON:SE_.*'
    ], path: "${params.outdir}/${params.aligner}", mode: params.publish_dir_mode
    select: 'NFCORE_RNASEQ:RNASEQ:SAMTOOLS_SORT'            , pattern: '*.bam', path: "${params.outdir}/${params.aligner}"              , enabled: { params.save_align_intermeds || params.save_umi_intermeds }
    select: 'NFCORE_RNASEQ:RNASEQ:UMITOOLS_PREPAREFORSALMON', pattern: '*.log', path: "${params.outdir}/${params.aligner}/umitools/logs", enabled: { params.save_align_intermeds || params.save_umi_intermeds }
    select: 'NFCORE_RNASEQ:RNASEQ:UMITOOLS_PREPAREFORSALMON', pattern: '*.bam', path: "${params.outdir}/${params.aligner}"              , enabled: { params.save_align_intermeds || params.save_umi_intermeds }
}

adamrtalbot avatar Mar 05 '24 14:03 adamrtalbot

@adamrtalbot check out the rnaseq PR that I linked above, it is less verbose than what you did. The syntax has evolved somewhat since Paolo's original proposal. Even so, the workflow outputs for rnaseq will be complicated no matter how you slice it.

bentsherman avatar Mar 05 '24 21:03 bentsherman

Liking this. If we really need the output block (rather than doing something with emit), this is a nice readable way of doing it.

pinin4fjords avatar Apr 04 '24 09:04 pinin4fjords

This is beginning to look great. All the publishing logic is in one location, easy to review and understand where it's coming from. There are two downsides to this approach:

  1. You need to track back to the channel to find what's in there, which could be a little tricky.
  2. It's quite verbose (there's a lot of text in one place). But then I would prefer explicit and verbose to implicit and concise.

adamrtalbot avatar Apr 04 '24 10:04 adamrtalbot

This is beginning to look great. All the publishing logic is in one location, easy to review and understand where it's coming from. There are two downsides to this approach:

1. You need to track back to the channel to find what's in there, which could be a little tricky.

2. It's quite verbose (there's a lot of text in one place). But then I would prefer explicit and verbose to implicit and concise.

Agreeing with Adam, it's a bit too implicit, especially what is a path what is a topic

maxulysse avatar Apr 04 '24 10:04 maxulysse

In the Nextflow PR there are some docs which explain the feature in more detail. Unfortunately the deploy preview isn't working so you'll have to look at the diff

You need to track back to the channel to find what's in there, which could be a little tricky.

Indeed this is the downside of selecting channels instead of processes. More flexible but more layers of indirection. We should be able to alleviate this with IDE tooling, i.e. hover over a selected channel to see it's definition

If we really need the output block (rather than doing something with emit), this is a nice readable way of doing it

Thanks @pinin4fjords , I never responded to your idea about putting everything in the emit section, but basically I think that would be way too cumbersome, imagine trying to fit the rnaseq outputs into the emits 😅

The main question now is, how to bring the outputs for PREPARE_GENOME and RNASEQ up to the top-level workflow? I was thinking some kind of include statement, otherwise we would have to pass a LOT of channels up through emits and/or topics.

bentsherman avatar Apr 04 '24 12:04 bentsherman

The current prototype simply maps the output channels to the publish directory structure, but we still need to get these outputs to the top level whereas currently they are nested under NFCORE_RNASEQ:...

Before I go off and add a gajillion channels to the emit section, I'd like to see if I can simplify things with topics.

@adamrtalbot @pinin4fjords @maxulysse @ewels Since you guys understand this pipeline better than me, I'm wondering, how would you group all of these outputs if you could group them any way you want? You are no longer restricted to process selectors or directory names, but you could use those if you wanted.

For example, I see the modules config for RNASEQ is grouped with these comments:

  • STAR Salmon alignment
  • General alignment
  • bigwig coverage
  • DESeq2 QC
  • Pseudo-alignment

Would those be good top-level groupings for outputs? Then you might have topics called align-star-salmon, align, bigwig, deseq2, etc. Or would you organize it differently?

bentsherman avatar Apr 08 '24 20:04 bentsherman

I managed to move everything to the top-level workflow, so it should be executable now (though there are likely some bugs, will test tomorrow).

I ended up using topics for everything, using the various publish directories to guide the topic names. Hope this gives you a more concrete sense of how topics are useful.

The topics don't really reduce the amount of code, they just split it between the output DSL and the workflow topic: section. In a weird way, this provides some modularity, since workflows can define some ontology of topics which can in turn be used by the output DSL for publishing.

bentsherman avatar Apr 09 '24 04:04 bentsherman

As Evan mentioned on Slack, this does seem very verbose:

    QUANTIFY_STAR_SALMON.out.results                        >> 'align'
    QUANTIFY_STAR_SALMON.out.tpm_gene                       >> 'align'
    QUANTIFY_STAR_SALMON.out.counts_gene                    >> 'align'
    QUANTIFY_STAR_SALMON.out.lengths_gene                   >> 'align'
    QUANTIFY_STAR_SALMON.out.counts_gene_length_scaled      >> 'align'
    QUANTIFY_STAR_SALMON.out.counts_gene_scaled             >> 'align'
    QUANTIFY_STAR_SALMON.out.tpm_transcript                 >> 'align'
    QUANTIFY_STAR_SALMON.out.counts_transcript              >> 'align'
    QUANTIFY_STAR_SALMON.out.lengths_transcript             >> 'align'

But I understand why, since if even one of the outputs from a process needs to go to a different topic then you can't use the multi-channel object QUANTIFY_STAR_SALMON.out.

Rather than doing this from the calling workflow, could e.g. QUANTIFY_STAR_SALMON use a topic as part of its emit, to 'suggest' a classification for that channel?

emit:
    results = ch_pseudo_results, topic = 'tables'                             

Then, if we all used good standards (e.g. an ontology for topics for outputs), calling workflows could have very minimal logic for this, relying on what the components said about their outputs. The calling workflow would only need to decide what to do with the topics in its outputs.

pinin4fjords avatar Apr 09 '24 09:04 pinin4fjords

We can definitely move some of these topic mappings into the modules and subworkflows, that was going to be my next step. I also suspect that nf-core will be able to converge on a shared ontology for these things.

I'd still rather keep the topic mapping separate from the emits though, as we will need the topic: section either way and we're trying to minimize the number of ways to do the same thing

bentsherman avatar Apr 09 '24 14:04 bentsherman

I moved most of the topic mappings into their respective subworkflows. It gets tricky when a workflow is used multiple times under a different name and with different publish behavior.

For example, QUANTIFY_PSEUDO_ALIGNMENT is used twice in RNASEQ, once as itself and once as the alias QUANTIFY_STAR_SALMON. One publishes to the folder "${params.aligner}" while the other publishes to "${params.pseudo_aligner}".

I can't set a "sensible default" in the subworkflow because I can't override the default later, I can only specify additional topics. Or I could specify a default and not use it in the output definition for rnaseq, instead re-mapping each alias to different topics as I am currently doing.

However, keeping the topic mappings in the RNASEQ workflow is also tricky because the process/workflow might not be executed, in which case the topic mapping will fail. We might need to replicate the control flow in the topic: section:

  topic:
  if{ !params.skip_alignment && params.aligner == 'star_rsem' ) {
    DESEQ2_QC_RSEM.out.rdata                >> 'align-deseq2'
    DESEQ2_QC_RSEM.out.pca_txt              >> 'align-deseq2'
    DESEQ2_QC_RSEM.out.pdf                  >> 'align-deseq2'
    DESEQ2_QC_RSEM.out.dists_txt            >> 'align-deseq2'
    DESEQ2_QC_RSEM.out.size_factors         >> 'align-deseq2'
    DESEQ2_QC_RSEM.out.log                  >> 'align-deseq2'
  }

Totally doable, but unfortunate if we have to resort to it

@adamrtalbot noted in Slack that most Nextflow pipelines don't come close to this level of complexity, so I wouldn't be opposed to moving forward with what we have and let the rnaseq maintainers sort out the details. Though we do need to address the last point about conditional topic mappings

bentsherman avatar Apr 09 '24 21:04 bentsherman

I'm loving the principle:

    "${params.aligner}" {
        'log' {
            from 'align-star-log'
        }

        from 'align-star-intermeds'

        'unmapped' {
            from 'align-star-unaligned'
        }
    }

The multi import thing didn't occur to me. Could we use a variable sent in via the meta or somesuch to control the topic something gets sent to?

pinin4fjords avatar Apr 10 '24 08:04 pinin4fjords

For example, QUANTIFY_PSEUDO_ALIGNMENT is used twice in RNASEQ, once as itself and once as the alias QUANTIFY_STAR_SALMON. One publishes to the folder "${params.aligner}" while the other publishes to "${params.pseudo_aligner}".

In this case I would manipulate the channel to what I wanted. If I had to use a topic I would use them at the last second. So again, as long as topics are optional I think everything can be handled reasonably well.

However, keeping the topic mappings in the RNASEQ workflow is also tricky because the process/workflow might not be executed, in which case the topic mapping will fail. We might need to replicate the control flow in the topic: section:

Presumably, if a topic is empty it just doesn't publish anything? So you could add stuff from an empty channel and you would end up with an empty topic. In your example, it would make more sense to fix the rnaseq code so it doesn't rely on lots of if statements, which would end up looking like this:

  topic:
  deseq2_qc_rdata        >> 'align-deseq2'
  deseq2_qc_pca_txt      >> 'align-deseq2'
  deseq2_qc_pdf          >> 'align-deseq2'
  deseq2_qc_dists_txt    >> 'align-deseq2'
  deseq2_qc_size_factors >> 'align-deseq2'
  deseq2_qc_log          >> 'align-deseq2'

Even better, just tidy up the channels before making the topic:

  topic:
  deseq2_qc >> 'align-deseq2'

I think my overall impression is topics are a nice sugar on top of existing channels, in which case most of the key logic should be in the channel manipulations. Topics are a way of turning a local channel into a global one and should do very little else.

One publishes to the folder "${params.aligner}" while the other publishes to "${params.pseudo_aligner}".

That sounds like a bug 😆

adamrtalbot avatar Apr 10 '24 11:04 adamrtalbot

Notes on the latest update:

  • Topics are no longer used. Nextflow simply maintains a global map of channels to "rules" under the hood

  • The output DSL is no longer a potentially nested directory structure, it's just a flat list of rules. Each rule can specify publish options for channels that are sent to the rule

  • In principle, the rule name can be anything. In practice, it is convenient to make it the default publish path. If you're happy with that, you don't need to configure anything else and Nextflow will use it as the publish path

  • Processes and workflows can have a publish: section to define these mappings. A process can map emits to rules, a workflow can map channels to rules

  • The output DSL is used only to (1) set the output directory, (2) set default publish options like mode, (3) customize rules as needed

  • In general, rules need to be customized only when the path should be different or additional options like enabled are needed. If you can align your output directory with the module/workflow defaults, your output definition can be quite short (see fetchngs)

  • If a process maps some emits to some rules and then is invoked by a workflow, the workflow can re-map the process outputs to different rules and overwrite the process defaults, and so on with workflows and subworkflows, etc

Overall, everything is much more concise and more in line with what many people have suggested, to simply annotate the workflows with the publish paths. The output definition is no longer a comprehensive view of all outputs, but there is a degree of modularity, and you can be verbose in the output definition if you want to.

@adamrtalbot thanks for your comments, makes me feel more confident about the prototype. I think all of the remaining TODOs can be addressed by refactoring some dataflow logic, it can be handled by the rnaseq devs.

bentsherman avatar Apr 11 '24 19:04 bentsherman

Really liking the way this is going now, it's going to be very tidy.

Would it be feasible at some point to use some optional dynamism in the modules, to facilitate repeated usage?

    publish:
    ch_orig_bam         >> "star_salmon/intermeds/${meta.publish_suffix}/"

pinin4fjords avatar Apr 12 '24 08:04 pinin4fjords

Would it be feasible at some point to use some optional dynamism in the modules, to facilitate repeated usage?

Maybe in a future iteration. But related to this, we are interested in building on the concept of the samplesheet as a way to hold metadata for file collections in general, and it might be a better practice than trying to encode metadata in the filenames.

For example Paolo has proposed that we have a method in the output DSL to automatically publish an index file for a given "rule":

output {
  directory 'results'

  'star_salmon/intermeds/' {
    index 'index.csv'
  }
}

star_salmon/intermeds/index.csv

sample_id,bam
sample001,results/star_salmon/intermeds/sample001.bam
sample002,results/star_salmon/intermeds/sample002.bam
sample003,results/star_salmon/intermeds/sample003.bam

Of course you could also do this manually like in fetchngs, and I would like to add a stdlib function like mergeCsv to make it easier, but the index method would be a convenient solution for the most common and simple cases. Either way, you can just query the index file instead of inspecting the file names.

bentsherman avatar Apr 12 '24 13:04 bentsherman

The redirect to null simplifies the top-level publish def somewhat. The remaining rules could also be moved into the workflow defs since they only rename paths. It just might be more verbose since you would have to remap each channel instead of the target name.

It seems like the best delineation for what goes in the top-level publish block vs the workflow publish sections is, the workflows define what is published (including conditional logic) while the top-level publish def should define how things are being published (mode, whether to overwrite, content type, tags, etc). This is also good for modularity.

Note that some subworkflows are now using params which is an anti-pattern. For this I recommend passing those params as workflow inputs to keep things modular.

bentsherman avatar Apr 24 '24 22:04 bentsherman

Note that some subworkflows are now using params which is an anti-pattern. For this I recommend passing those params as workflow inputs to keep things modular.

We have been trying to eliminate that when we see it

pinin4fjords avatar Apr 25 '24 08:04 pinin4fjords