rnaseq
rnaseq copied to clipboard
Workflow output definition
This PR adds a workflow output definition based on https://github.com/nextflow-io/nextflow/pull/4784. I'm still working through the pipeline, but once I'm done, I will have completely replaced publishDir using the output DSL.
See also https://github.com/nf-core/fetchngs/pull/275 for ongoing discussion
nf-core lint
overall result: Passed :white_check_mark: :warning:
Posted for pipeline commit 783ff86
+| ✅ 170 tests passed |+
#| ❔ 7 tests were ignored |#
!| ❗ 7 tests had warnings |!
:heavy_exclamation_mark: Test warnings:
-
files_exist - File not found:
assets/multiqc_config.yml
-
files_exist - File not found:
.github/workflows/awstest.yml
-
files_exist - File not found:
.github/workflows/awsfulltest.yml
-
pipeline_todos - TODO string in
methods_description_template.yml
: #Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline -
pipeline_todos - TODO string in
main.nf
: Optionally add in-text citation tools to this list. -
pipeline_todos - TODO string in
main.nf
: Optionally add bibliographic entries to this list. -
pipeline_todos - TODO string in
main.nf
: Only uncomment below if logic in toolCitationText/toolBibliographyText has been filled!
:grey_question: Tests ignored:
-
files_exist - File is ignored:
conf/modules.config
- nextflow_config - Config default ignored: params.ribo_database_manifest
-
files_unchanged - File ignored due to lint config:
assets/email_template.html
-
files_unchanged - File ignored due to lint config:
assets/email_template.txt
-
files_unchanged - File ignored due to lint config:
.gitignore
or.prettierignore
orpyproject.toml
-
actions_awstest - 'awstest.yml' workflow not found:
/home/runner/work/rnaseq/rnaseq/.github/workflows/awstest.yml
- multiqc_config - 'assets/multiqc_config.yml' not found
:white_check_mark: Tests passed:
-
files_exist - File found:
.gitattributes
-
files_exist - File found:
.gitignore
-
files_exist - File found:
.nf-core.yml
-
files_exist - File found:
.editorconfig
-
files_exist - File found:
.prettierignore
-
files_exist - File found:
.prettierrc.yml
-
files_exist - File found:
CHANGELOG.md
-
files_exist - File found:
CITATIONS.md
-
files_exist - File found:
CODE_OF_CONDUCT.md
-
files_exist - File found:
LICENSE
orLICENSE.md
orLICENCE
orLICENCE.md
-
files_exist - File found:
nextflow_schema.json
-
files_exist - File found:
nextflow.config
-
files_exist - File found:
README.md
-
files_exist - File found:
.github/.dockstore.yml
-
files_exist - File found:
.github/CONTRIBUTING.md
-
files_exist - File found:
.github/ISSUE_TEMPLATE/bug_report.yml
-
files_exist - File found:
.github/ISSUE_TEMPLATE/config.yml
-
files_exist - File found:
.github/ISSUE_TEMPLATE/feature_request.yml
-
files_exist - File found:
.github/PULL_REQUEST_TEMPLATE.md
-
files_exist - File found:
.github/workflows/branch.yml
-
files_exist - File found:
.github/workflows/ci.yml
-
files_exist - File found:
.github/workflows/linting_comment.yml
-
files_exist - File found:
.github/workflows/linting.yml
-
files_exist - File found:
assets/email_template.html
-
files_exist - File found:
assets/email_template.txt
-
files_exist - File found:
assets/sendmail_template.txt
-
files_exist - File found:
assets/nf-core-rnaseq_logo_light.png
-
files_exist - File found:
conf/test.config
-
files_exist - File found:
conf/test_full.config
-
files_exist - File found:
docs/images/nf-core-rnaseq_logo_light.png
-
files_exist - File found:
docs/images/nf-core-rnaseq_logo_dark.png
-
files_exist - File found:
docs/output.md
-
files_exist - File found:
docs/README.md
-
files_exist - File found:
docs/README.md
-
files_exist - File found:
docs/usage.md
-
files_exist - File found:
main.nf
-
files_exist - File found:
conf/base.config
-
files_exist - File found:
conf/igenomes.config
-
files_exist - File found:
modules.json
-
files_exist - File found:
pyproject.toml
-
files_exist - File not found check:
Singularity
-
files_exist - File not found check:
parameters.settings.json
-
files_exist - File not found check:
pipeline_template.yml
-
files_exist - File not found check:
.nf-core.yaml
-
files_exist - File not found check:
bin/markdown_to_html.r
-
files_exist - File not found check:
conf/aws.config
-
files_exist - File not found check:
.github/workflows/push_dockerhub.yml
-
files_exist - File not found check:
.github/ISSUE_TEMPLATE/bug_report.md
-
files_exist - File not found check:
.github/ISSUE_TEMPLATE/feature_request.md
-
files_exist - File not found check:
docs/images/nf-core-rnaseq_logo.png
-
files_exist - File not found check:
.markdownlint.yml
-
files_exist - File not found check:
.yamllint.yml
-
files_exist - File not found check:
lib/Checks.groovy
-
files_exist - File not found check:
lib/Completion.groovy
-
files_exist - File not found check:
lib/Workflow.groovy
-
files_exist - File not found check:
lib/Utils.groovy
-
files_exist - File not found check:
lib/WorkflowMain.groovy
-
files_exist - File not found check:
lib/NfcoreTemplate.groovy
-
files_exist - File not found check:
lib/WorkflowRnaseq.groovy
-
files_exist - File not found check:
lib/nfcore_external_java_deps.jar
-
files_exist - File not found check:
.travis.yml
-
nextflow_config - Config variable found:
manifest.name
-
nextflow_config - Config variable found:
manifest.nextflowVersion
-
nextflow_config - Config variable found:
manifest.description
-
nextflow_config - Config variable found:
manifest.version
-
nextflow_config - Config variable found:
manifest.homePage
-
nextflow_config - Config variable found:
timeline.enabled
-
nextflow_config - Config variable found:
trace.enabled
-
nextflow_config - Config variable found:
report.enabled
-
nextflow_config - Config variable found:
dag.enabled
-
nextflow_config - Config variable found:
process.cpus
-
nextflow_config - Config variable found:
process.memory
-
nextflow_config - Config variable found:
process.time
-
nextflow_config - Config variable found:
params.outdir
-
nextflow_config - Config variable found:
params.input
-
nextflow_config - Config variable found:
params.validationShowHiddenParams
-
nextflow_config - Config variable found:
params.validationSchemaIgnoreParams
-
nextflow_config - Config variable found:
manifest.mainScript
-
nextflow_config - Config variable found:
timeline.file
-
nextflow_config - Config variable found:
trace.file
-
nextflow_config - Config variable found:
report.file
-
nextflow_config - Config variable found:
dag.file
-
nextflow_config - Config variable (correctly) not found:
params.nf_required_version
-
nextflow_config - Config variable (correctly) not found:
params.container
-
nextflow_config - Config variable (correctly) not found:
params.singleEnd
-
nextflow_config - Config variable (correctly) not found:
params.igenomesIgnore
-
nextflow_config - Config variable (correctly) not found:
params.name
-
nextflow_config - Config variable (correctly) not found:
params.enable_conda
-
nextflow_config - Config
timeline.enabled
had correct value:true
-
nextflow_config - Config
report.enabled
had correct value:true
-
nextflow_config - Config
trace.enabled
had correct value:true
-
nextflow_config - Config
dag.enabled
had correct value:true
-
nextflow_config - Config
manifest.name
began withnf-core/
-
nextflow_config - Config variable
manifest.homePage
began with https://github.com/nf-core/ -
nextflow_config - Config
dag.file
ended with.html
-
nextflow_config - Config variable
manifest.nextflowVersion
started with >= or !>= -
nextflow_config - Config
manifest.version
ends indev
:3.15.0dev
-
nextflow_config - Config
params.custom_config_version
is set tomaster
-
nextflow_config - Config
params.custom_config_base
is set tohttps://raw.githubusercontent.com/nf-core/configs/master
- nextflow_config - Lines for loading custom profiles found
-
nextflow_config - nextflow.config contains configuration profile
test
- nextflow_config - Config default value correct: params.hisat2_build_memory= 200.GB
- nextflow_config - Config default value correct: params.gtf_extra_attributes= gene_name
- nextflow_config - Config default value correct: params.gtf_group_features= gene_id
- nextflow_config - Config default value correct: params.featurecounts_group_type= gene_biotype
- nextflow_config - Config default value correct: params.featurecounts_feature_type= exon
- nextflow_config - Config default value correct: params.igenomes_base= s3://ngi-igenomes/igenomes
- nextflow_config - Config default value correct: params.trimmer= trimgalore
- nextflow_config - Config default value correct: params.min_trimmed_reads= 10000
- nextflow_config - Config default value correct: params.umitools_extract_method= string
- nextflow_config - Config default value correct: params.umitools_grouping_method= directional
- nextflow_config - Config default value correct: params.aligner= star_salmon
- nextflow_config - Config default value correct: params.pseudo_aligner_kmer_size= 31
- nextflow_config - Config default value correct: params.min_mapped_reads= 5.0
- nextflow_config - Config default value correct: params.kallisto_quant_fraglen= 200
- nextflow_config - Config default value correct: params.kallisto_quant_fraglen_sd= 200
- nextflow_config - Config default value correct: params.deseq2_vst= true
- nextflow_config - Config default value correct: params.rseqc_modules= bam_stat,inner_distance,infer_experiment,junction_annotation,junction_saturation,read_distribution,read_duplication
- nextflow_config - Config default value correct: params.skip_bbsplit= true
- nextflow_config - Config default value correct: params.skip_preseq= true
- nextflow_config - Config default value correct: params.custom_config_version= master
- nextflow_config - Config default value correct: params.custom_config_base= https://raw.githubusercontent.com/nf-core/configs/master
- nextflow_config - Config default value correct: params.max_cpus= 16
- nextflow_config - Config default value correct: params.max_memory= 128.GB
- nextflow_config - Config default value correct: params.max_time= 240.h
- nextflow_config - Config default value correct: params.publish_dir_mode= copy
- nextflow_config - Config default value correct: params.max_multiqc_email_size= 25.MB
- nextflow_config - Config default value correct: params.validate_params= true
-
files_unchanged -
.gitattributes
matches the template -
files_unchanged -
.prettierrc.yml
matches the template -
files_unchanged -
CODE_OF_CONDUCT.md
matches the template -
files_unchanged -
LICENSE
matches the template -
files_unchanged -
.github/.dockstore.yml
matches the template -
files_unchanged -
.github/CONTRIBUTING.md
matches the template -
files_unchanged -
.github/ISSUE_TEMPLATE/bug_report.yml
matches the template -
files_unchanged -
.github/ISSUE_TEMPLATE/config.yml
matches the template -
files_unchanged -
.github/ISSUE_TEMPLATE/feature_request.yml
matches the template -
files_unchanged -
.github/PULL_REQUEST_TEMPLATE.md
matches the template -
files_unchanged -
.github/workflows/branch.yml
matches the template -
files_unchanged -
.github/workflows/linting_comment.yml
matches the template -
files_unchanged -
.github/workflows/linting.yml
matches the template -
files_unchanged -
assets/sendmail_template.txt
matches the template -
files_unchanged -
assets/nf-core-rnaseq_logo_light.png
matches the template -
files_unchanged -
docs/images/nf-core-rnaseq_logo_light.png
matches the template -
files_unchanged -
docs/images/nf-core-rnaseq_logo_dark.png
matches the template -
files_unchanged -
docs/README.md
matches the template - actions_ci - '.github/workflows/ci.yml' is triggered on expected events
- actions_ci - '.github/workflows/ci.yml' checks minimum NF version
-
readme - README Nextflow minimum version badge matched config. Badge:
23.04.0
, Config:23.04.0
- readme - README Zenodo placeholder was replaced with DOI.
- pipeline_name_conventions - Name adheres to nf-core convention
- template_strings - Did not find any Jinja template strings (531 files)
- schema_lint - Schema lint passed
- schema_lint - Schema title + description lint passed
- schema_lint - Input mimetype lint passed: 'text/csv'
- schema_params - Schema matched params returned from nextflow config
-
system_exit - No
System.exit
calls found - actions_schema_validation - Workflow validation passed: branch.yml
- actions_schema_validation - Workflow validation passed: cloud_tests_small.yml
- actions_schema_validation - Workflow validation passed: linting.yml
- actions_schema_validation - Workflow validation passed: ci.yml
- actions_schema_validation - Workflow validation passed: clean-up.yml
- actions_schema_validation - Workflow validation passed: fix-linting.yml
- actions_schema_validation - Workflow validation passed: linting_comment.yml
- actions_schema_validation - Workflow validation passed: download_pipeline.yml
- actions_schema_validation - Workflow validation passed: cloud_tests_full.yml
- actions_schema_validation - Workflow validation passed: release-announcements.yml
- merge_markers - No merge markers found in pipeline files
-
modules_json - Only installed modules found in
modules.json
- modules_structure - modules directory structure is correct 'modules/nf-core/TOOL/SUBTOOL'
Run details
- nf-core/tools version 2.13
- Run at
2024-02-28 22:32:03
Looking at this PR, we would be moving the logic of the publishing from the configuration to the pipeline. I'm quite happy with that, but:
- I know some people like publishing being a property of configuration (not convinced myself).
- Could this solution be part of the
emit
block to create one place for outputs? - Could a
.publish()
operator to achieve the same thing but with a simpler syntax?
OK I've read this comment which has made it more clear.
We add a second structure to Nextflow, the output
block, which includes the definitions about what should be published where. This falls outside of processes and workflows which makes it more flexible. The original idea was to use topics but that was imprecise so process selectors are being used instead.
Taking a look again, I think the general idea is sound but I'm not sold on the syntax. I tried to rewrite it the other way around but ended up just re-inventing publishDir but worse 😆
output {
select: [
'.*:QUANTIFY_STAR_SALMON:SALMON_QUANT',
'.*:QUANTIFY_STAR_SALMON:CUSTOM_TX2GENE',
'.*:QUANTIFY_STAR_SALMON:TXIMETA_TXIMPORT',
'.*:QUANTIFY_STAR_SALMON:SE_.*'
], path: "${params.outdir}/${params.aligner}", mode: params.publish_dir_mode
select: 'NFCORE_RNASEQ:RNASEQ:SAMTOOLS_SORT' , pattern: '*.bam', path: "${params.outdir}/${params.aligner}" , enabled: { params.save_align_intermeds || params.save_umi_intermeds }
select: 'NFCORE_RNASEQ:RNASEQ:UMITOOLS_PREPAREFORSALMON', pattern: '*.log', path: "${params.outdir}/${params.aligner}/umitools/logs", enabled: { params.save_align_intermeds || params.save_umi_intermeds }
select: 'NFCORE_RNASEQ:RNASEQ:UMITOOLS_PREPAREFORSALMON', pattern: '*.bam', path: "${params.outdir}/${params.aligner}" , enabled: { params.save_align_intermeds || params.save_umi_intermeds }
}
@adamrtalbot check out the rnaseq PR that I linked above, it is less verbose than what you did. The syntax has evolved somewhat since Paolo's original proposal. Even so, the workflow outputs for rnaseq will be complicated no matter how you slice it.
Liking this. If we really need the output block (rather than doing something with emit), this is a nice readable way of doing it.
This is beginning to look great. All the publishing logic is in one location, easy to review and understand where it's coming from. There are two downsides to this approach:
- You need to track back to the channel to find what's in there, which could be a little tricky.
- It's quite verbose (there's a lot of text in one place). But then I would prefer explicit and verbose to implicit and concise.
This is beginning to look great. All the publishing logic is in one location, easy to review and understand where it's coming from. There are two downsides to this approach:
1. You need to track back to the channel to find what's in there, which could be a little tricky. 2. It's quite verbose (there's a lot of text in one place). But then I would prefer explicit and verbose to implicit and concise.
Agreeing with Adam, it's a bit too implicit, especially what is a path what is a topic
In the Nextflow PR there are some docs which explain the feature in more detail. Unfortunately the deploy preview isn't working so you'll have to look at the diff
You need to track back to the channel to find what's in there, which could be a little tricky.
Indeed this is the downside of selecting channels instead of processes. More flexible but more layers of indirection. We should be able to alleviate this with IDE tooling, i.e. hover over a selected channel to see it's definition
If we really need the output block (rather than doing something with emit), this is a nice readable way of doing it
Thanks @pinin4fjords , I never responded to your idea about putting everything in the emit
section, but basically I think that would be way too cumbersome, imagine trying to fit the rnaseq outputs into the emits 😅
The main question now is, how to bring the outputs for PREPARE_GENOME and RNASEQ up to the top-level workflow? I was thinking some kind of include
statement, otherwise we would have to pass a LOT of channels up through emits and/or topics.
The current prototype simply maps the output channels to the publish directory structure, but we still need to get these outputs to the top level whereas currently they are nested under NFCORE_RNASEQ:...
Before I go off and add a gajillion channels to the emit
section, I'd like to see if I can simplify things with topics.
@adamrtalbot @pinin4fjords @maxulysse @ewels Since you guys understand this pipeline better than me, I'm wondering, how would you group all of these outputs if you could group them any way you want? You are no longer restricted to process selectors or directory names, but you could use those if you wanted.
For example, I see the modules config for RNASEQ is grouped with these comments:
- STAR Salmon alignment
- General alignment
- bigwig coverage
- DESeq2 QC
- Pseudo-alignment
Would those be good top-level groupings for outputs? Then you might have topics called align-star-salmon
, align
, bigwig
, deseq2
, etc. Or would you organize it differently?
I managed to move everything to the top-level workflow, so it should be executable now (though there are likely some bugs, will test tomorrow).
I ended up using topics for everything, using the various publish directories to guide the topic names. Hope this gives you a more concrete sense of how topics are useful.
The topics don't really reduce the amount of code, they just split it between the output DSL and the workflow topic:
section. In a weird way, this provides some modularity, since workflows can define some ontology of topics which can in turn be used by the output DSL for publishing.
As Evan mentioned on Slack, this does seem very verbose:
QUANTIFY_STAR_SALMON.out.results >> 'align'
QUANTIFY_STAR_SALMON.out.tpm_gene >> 'align'
QUANTIFY_STAR_SALMON.out.counts_gene >> 'align'
QUANTIFY_STAR_SALMON.out.lengths_gene >> 'align'
QUANTIFY_STAR_SALMON.out.counts_gene_length_scaled >> 'align'
QUANTIFY_STAR_SALMON.out.counts_gene_scaled >> 'align'
QUANTIFY_STAR_SALMON.out.tpm_transcript >> 'align'
QUANTIFY_STAR_SALMON.out.counts_transcript >> 'align'
QUANTIFY_STAR_SALMON.out.lengths_transcript >> 'align'
But I understand why, since if even one of the outputs from a process needs to go to a different topic then you can't use the multi-channel object QUANTIFY_STAR_SALMON.out
.
Rather than doing this from the calling workflow, could e.g. QUANTIFY_STAR_SALMON use a topic as part of its emit, to 'suggest' a classification for that channel?
emit:
results = ch_pseudo_results, topic = 'tables'
Then, if we all used good standards (e.g. an ontology for topics for outputs), calling workflows could have very minimal logic for this, relying on what the components said about their outputs. The calling workflow would only need to decide what to do with the topics in its outputs.
We can definitely move some of these topic mappings into the modules and subworkflows, that was going to be my next step. I also suspect that nf-core will be able to converge on a shared ontology for these things.
I'd still rather keep the topic mapping separate from the emits though, as we will need the topic:
section either way and we're trying to minimize the number of ways to do the same thing
I moved most of the topic mappings into their respective subworkflows. It gets tricky when a workflow is used multiple times under a different name and with different publish behavior.
For example, QUANTIFY_PSEUDO_ALIGNMENT
is used twice in RNASEQ
, once as itself and once as the alias QUANTIFY_STAR_SALMON
. One publishes to the folder "${params.aligner}"
while the other publishes to "${params.pseudo_aligner}"
.
I can't set a "sensible default" in the subworkflow because I can't override the default later, I can only specify additional topics. Or I could specify a default and not use it in the output definition for rnaseq, instead re-mapping each alias to different topics as I am currently doing.
However, keeping the topic mappings in the RNASEQ workflow is also tricky because the process/workflow might not be executed, in which case the topic mapping will fail. We might need to replicate the control flow in the topic:
section:
topic:
if{ !params.skip_alignment && params.aligner == 'star_rsem' ) {
DESEQ2_QC_RSEM.out.rdata >> 'align-deseq2'
DESEQ2_QC_RSEM.out.pca_txt >> 'align-deseq2'
DESEQ2_QC_RSEM.out.pdf >> 'align-deseq2'
DESEQ2_QC_RSEM.out.dists_txt >> 'align-deseq2'
DESEQ2_QC_RSEM.out.size_factors >> 'align-deseq2'
DESEQ2_QC_RSEM.out.log >> 'align-deseq2'
}
Totally doable, but unfortunate if we have to resort to it
@adamrtalbot noted in Slack that most Nextflow pipelines don't come close to this level of complexity, so I wouldn't be opposed to moving forward with what we have and let the rnaseq maintainers sort out the details. Though we do need to address the last point about conditional topic mappings
I'm loving the principle:
"${params.aligner}" {
'log' {
from 'align-star-log'
}
from 'align-star-intermeds'
'unmapped' {
from 'align-star-unaligned'
}
}
The multi import thing didn't occur to me. Could we use a variable sent in via the meta
or somesuch to control the topic something gets sent to?
For example, QUANTIFY_PSEUDO_ALIGNMENT is used twice in RNASEQ, once as itself and once as the alias QUANTIFY_STAR_SALMON. One publishes to the folder "${params.aligner}" while the other publishes to "${params.pseudo_aligner}".
In this case I would manipulate the channel to what I wanted. If I had to use a topic I would use them at the last second. So again, as long as topics are optional I think everything can be handled reasonably well.
However, keeping the topic mappings in the RNASEQ workflow is also tricky because the process/workflow might not be executed, in which case the topic mapping will fail. We might need to replicate the control flow in the topic: section:
Presumably, if a topic
is empty it just doesn't publish anything? So you could add stuff from an empty channel and you would end up with an empty topic. In your example, it would make more sense to fix the rnaseq code so it doesn't rely on lots of if
statements, which would end up looking like this:
topic:
deseq2_qc_rdata >> 'align-deseq2'
deseq2_qc_pca_txt >> 'align-deseq2'
deseq2_qc_pdf >> 'align-deseq2'
deseq2_qc_dists_txt >> 'align-deseq2'
deseq2_qc_size_factors >> 'align-deseq2'
deseq2_qc_log >> 'align-deseq2'
Even better, just tidy up the channels before making the topic:
topic:
deseq2_qc >> 'align-deseq2'
I think my overall impression is topics are a nice sugar on top of existing channels, in which case most of the key logic should be in the channel manipulations. Topics are a way of turning a local channel into a global one and should do very little else.
One publishes to the folder "${params.aligner}" while the other publishes to "${params.pseudo_aligner}".
That sounds like a bug 😆
Notes on the latest update:
-
Topics are no longer used. Nextflow simply maintains a global map of channels to "rules" under the hood
-
The output DSL is no longer a potentially nested directory structure, it's just a flat list of rules. Each rule can specify publish options for channels that are sent to the rule
-
In principle, the rule name can be anything. In practice, it is convenient to make it the default publish path. If you're happy with that, you don't need to configure anything else and Nextflow will use it as the publish path
-
Processes and workflows can have a
publish:
section to define these mappings. A process can map emits to rules, a workflow can map channels to rules -
The output DSL is used only to (1) set the output directory, (2) set default publish options like mode, (3) customize rules as needed
-
In general, rules need to be customized only when the path should be different or additional options like
enabled
are needed. If you can align your output directory with the module/workflow defaults, your output definition can be quite short (see fetchngs) -
If a process maps some emits to some rules and then is invoked by a workflow, the workflow can re-map the process outputs to different rules and overwrite the process defaults, and so on with workflows and subworkflows, etc
Overall, everything is much more concise and more in line with what many people have suggested, to simply annotate the workflows with the publish paths. The output definition is no longer a comprehensive view of all outputs, but there is a degree of modularity, and you can be verbose in the output definition if you want to.
@adamrtalbot thanks for your comments, makes me feel more confident about the prototype. I think all of the remaining TODOs can be addressed by refactoring some dataflow logic, it can be handled by the rnaseq devs.
Really liking the way this is going now, it's going to be very tidy.
Would it be feasible at some point to use some optional dynamism in the modules, to facilitate repeated usage?
publish:
ch_orig_bam >> "star_salmon/intermeds/${meta.publish_suffix}/"
Would it be feasible at some point to use some optional dynamism in the modules, to facilitate repeated usage?
Maybe in a future iteration. But related to this, we are interested in building on the concept of the samplesheet as a way to hold metadata for file collections in general, and it might be a better practice than trying to encode metadata in the filenames.
For example Paolo has proposed that we have a method in the output DSL to automatically publish an index file for a given "rule":
output {
directory 'results'
'star_salmon/intermeds/' {
index 'index.csv'
}
}
star_salmon/intermeds/index.csv
sample_id,bam
sample001,results/star_salmon/intermeds/sample001.bam
sample002,results/star_salmon/intermeds/sample002.bam
sample003,results/star_salmon/intermeds/sample003.bam
Of course you could also do this manually like in fetchngs, and I would like to add a stdlib function like mergeCsv
to make it easier, but the index
method would be a convenient solution for the most common and simple cases. Either way, you can just query the index file instead of inspecting the file names.
The redirect to null
simplifies the top-level publish def somewhat. The remaining rules could also be moved into the workflow defs since they only rename paths. It just might be more verbose since you would have to remap each channel instead of the target name.
It seems like the best delineation for what goes in the top-level publish block vs the workflow publish sections is, the workflows define what is published (including conditional logic) while the top-level publish def should define how things are being published (mode, whether to overwrite, content type, tags, etc). This is also good for modularity.
Note that some subworkflows are now using params which is an anti-pattern. For this I recommend passing those params as workflow inputs to keep things modular.
Note that some subworkflows are now using params which is an anti-pattern. For this I recommend passing those params as workflow inputs to keep things modular.
We have been trying to eliminate that when we see it