Workflow publish definition
This PR is a prototype for defining an output schema for a Nextflow pipeline. See https://github.com/nextflow-io/nextflow/issues/4669 and https://github.com/nextflow-io/nextflow/issues/4670 for original discussions.
The meta workflow that we are targeting is:
fetchngs -> rnaseq -> differentialabundance
In other words, we want to eliminate the manual curation of samplesheets between each pipeline. To do this, the output schema should "mirror" the params schema, it should describe the outputs as a collection of samplesheets.
Here is the tree of pipeline outputs for the fetchngs test profile:
/custom/user-settings.mkfg
/fastq/DRX024467_DRR026872.fastq.gz
/fastq/DRX026011_DRR028935_1.fastq.gz
/fastq/DRX026011_DRR028935_2.fastq.gz
/fastq/ERX1234253_ERR1160846.fastq.gz
/fastq/SRX10940790_SRR14593545_1.fastq.gz
/fastq/SRX10940790_SRR14593545_2.fastq.gz
/fastq/SRX11047067_SRR14709033.fastq.gz
/fastq/SRX17709227_SRR21711856.fastq.gz
/fastq/SRX17709228_SRR21711855.fastq.gz
/fastq/SRX6725035_SRR9984183.fastq.gz
/fastq/SRX9315476_SRR12848126_1.fastq.gz
/fastq/SRX9315476_SRR12848126_2.fastq.gz
/fastq/SRX9504942_SRR13055517_1.fastq.gz
/fastq/SRX9504942_SRR13055517_2.fastq.gz
/fastq/SRX9504942_SRR13055518_1.fastq.gz
/fastq/SRX9504942_SRR13055518_2.fastq.gz
/fastq/SRX9504942_SRR13055519_1.fastq.gz
/fastq/SRX9504942_SRR13055519_2.fastq.gz
/fastq/SRX9504942_SRR13055520_1.fastq.gz
/fastq/SRX9504942_SRR13055520_2.fastq.gz
/fastq/SRX9626017_SRR13191702_1.fastq.gz
/fastq/SRX9626017_SRR13191702_2.fastq.gz
/fastq/md5/DRX024467_DRR026872.fastq.gz.md5
/fastq/md5/DRX026011_DRR028935_1.fastq.gz.md5
/fastq/md5/DRX026011_DRR028935_2.fastq.gz.md5
/fastq/md5/ERX1234253_ERR1160846.fastq.gz.md5
/fastq/md5/SRX17709227_SRR21711856.fastq.gz.md5
/fastq/md5/SRX17709228_SRR21711855.fastq.gz.md5
/fastq/md5/SRX6725035_SRR9984183.fastq.gz.md5
/fastq/md5/SRX9504942_SRR13055517_1.fastq.gz.md5
/fastq/md5/SRX9504942_SRR13055517_2.fastq.gz.md5
/fastq/md5/SRX9504942_SRR13055518_1.fastq.gz.md5
/fastq/md5/SRX9504942_SRR13055518_2.fastq.gz.md5
/fastq/md5/SRX9504942_SRR13055519_1.fastq.gz.md5
/fastq/md5/SRX9504942_SRR13055519_2.fastq.gz.md5
/fastq/md5/SRX9504942_SRR13055520_1.fastq.gz.md5
/fastq/md5/SRX9504942_SRR13055520_2.fastq.gz.md5
/fastq/md5/SRX9626017_SRR13191702_1.fastq.gz.md5
/fastq/md5/SRX9626017_SRR13191702_2.fastq.gz.md5
/metadata/DRR026872.runinfo_ftp.tsv
/metadata/DRR028935.runinfo_ftp.tsv
/metadata/ERR1160846.runinfo_ftp.tsv
/metadata/GSE214215.runinfo_ftp.tsv
/metadata/GSM4907283.runinfo_ftp.tsv
/metadata/SRR12848126.runinfo_ftp.tsv
/metadata/SRR13191702.runinfo_ftp.tsv
/metadata/SRR14593545.runinfo_ftp.tsv
/metadata/SRR14709033.runinfo_ftp.tsv
/metadata/SRR9984183.runinfo_ftp.tsv
/samplesheet/id_mappings.csv
/samplesheet/multiqc_config.yml
/samplesheet/samplesheet.csv
From what I can tell, the samplesheet.csv contains all of the metadata, including the file paths, MD5 checksums, and id mappings. So the samplesheet and the fastq files comprise the essential outputs and everything else is duplication.
The initial output schema basically describes this samplesheet in a similar manner to the input_schema.json file. This particular output schema should closely resemble the input_schema.json for nf-core/rnaseq.
What I'd to do from here is collect feedback on this approach -- what else is needed to complete the output schema for this pipeline? Then we can think about how to operationalize it in Nextflow -- should Nextflow automatically generate the samplesheet from the schema? how does the schema interact with the publish mechanism? how to collect metadata which normally can't be published directly but only through files?
nf-core lint overall result: Passed :white_check_mark: :warning:
Posted for pipeline commit b8dd9e2
+| ✅ 154 tests passed |+
#| ❔ 5 tests were ignored |#
!| ❗ 5 tests had warnings |!
:heavy_exclamation_mark: Test warnings:
- files_exist - File not found:
assets/multiqc_config.yml - files_exist - File not found:
conf/igenomes.config - files_exist - File not found:
.github/workflows/awstest.yml - files_exist - File not found:
.github/workflows/awsfulltest.yml - nextflow_config - Config
manifest.versionshould end indev:1.12.0
:grey_question: Tests ignored:
- files_exist - File is ignored:
conf/modules.config - files_unchanged - File ignored due to lint config:
assets/sendmail_template.txt - actions_ci - actions_ci
- actions_awstest - 'awstest.yml' workflow not found:
/home/runner/work/fetchngs/fetchngs/.github/workflows/awstest.yml - multiqc_config - 'assets/multiqc_config.yml' not found
:white_check_mark: Tests passed:
- files_exist - File found:
.gitattributes - files_exist - File found:
.gitignore - files_exist - File found:
.nf-core.yml - files_exist - File found:
.editorconfig - files_exist - File found:
.prettierignore - files_exist - File found:
.prettierrc.yml - files_exist - File found:
CHANGELOG.md - files_exist - File found:
CITATIONS.md - files_exist - File found:
CODE_OF_CONDUCT.md - files_exist - File found:
LICENSEorLICENSE.mdorLICENCEorLICENCE.md - files_exist - File found:
nextflow_schema.json - files_exist - File found:
nextflow.config - files_exist - File found:
README.md - files_exist - File found:
.github/.dockstore.yml - files_exist - File found:
.github/CONTRIBUTING.md - files_exist - File found:
.github/ISSUE_TEMPLATE/bug_report.yml - files_exist - File found:
.github/ISSUE_TEMPLATE/config.yml - files_exist - File found:
.github/ISSUE_TEMPLATE/feature_request.yml - files_exist - File found:
.github/PULL_REQUEST_TEMPLATE.md - files_exist - File found:
.github/workflows/branch.yml - files_exist - File found:
.github/workflows/ci.yml - files_exist - File found:
.github/workflows/linting_comment.yml - files_exist - File found:
.github/workflows/linting.yml - files_exist - File found:
assets/email_template.html - files_exist - File found:
assets/email_template.txt - files_exist - File found:
assets/sendmail_template.txt - files_exist - File found:
assets/nf-core-fetchngs_logo_light.png - files_exist - File found:
conf/test.config - files_exist - File found:
conf/test_full.config - files_exist - File found:
docs/images/nf-core-fetchngs_logo_light.png - files_exist - File found:
docs/images/nf-core-fetchngs_logo_dark.png - files_exist - File found:
docs/output.md - files_exist - File found:
docs/README.md - files_exist - File found:
docs/README.md - files_exist - File found:
docs/usage.md - files_exist - File found:
main.nf - files_exist - File found:
conf/base.config - files_exist - File found:
modules.json - files_exist - File found:
pyproject.toml - files_exist - File not found check:
Singularity - files_exist - File not found check:
parameters.settings.json - files_exist - File not found check:
pipeline_template.yml - files_exist - File not found check:
.nf-core.yaml - files_exist - File not found check:
bin/markdown_to_html.r - files_exist - File not found check:
conf/aws.config - files_exist - File not found check:
.github/workflows/push_dockerhub.yml - files_exist - File not found check:
.github/ISSUE_TEMPLATE/bug_report.md - files_exist - File not found check:
.github/ISSUE_TEMPLATE/feature_request.md - files_exist - File not found check:
docs/images/nf-core-fetchngs_logo.png - files_exist - File not found check:
.markdownlint.yml - files_exist - File not found check:
.yamllint.yml - files_exist - File not found check:
lib/Checks.groovy - files_exist - File not found check:
lib/Completion.groovy - files_exist - File not found check:
lib/Workflow.groovy - files_exist - File not found check:
lib/Utils.groovy - files_exist - File not found check:
lib/WorkflowMain.groovy - files_exist - File not found check:
lib/NfcoreTemplate.groovy - files_exist - File not found check:
lib/WorkflowFetchngs.groovy - files_exist - File not found check:
lib/nfcore_external_java_deps.jar - files_exist - File not found check:
.travis.yml - nextflow_config - Config variable found:
manifest.name - nextflow_config - Config variable found:
manifest.nextflowVersion - nextflow_config - Config variable found:
manifest.description - nextflow_config - Config variable found:
manifest.version - nextflow_config - Config variable found:
manifest.homePage - nextflow_config - Config variable found:
timeline.enabled - nextflow_config - Config variable found:
trace.enabled - nextflow_config - Config variable found:
report.enabled - nextflow_config - Config variable found:
dag.enabled - nextflow_config - Config variable found:
process.cpus - nextflow_config - Config variable found:
process.memory - nextflow_config - Config variable found:
process.time - nextflow_config - Config variable found:
params.outdir - nextflow_config - Config variable found:
params.input - nextflow_config - Config variable found:
params.validationShowHiddenParams - nextflow_config - Config variable found:
params.validationSchemaIgnoreParams - nextflow_config - Config variable found:
manifest.mainScript - nextflow_config - Config variable found:
timeline.file - nextflow_config - Config variable found:
trace.file - nextflow_config - Config variable found:
report.file - nextflow_config - Config variable found:
dag.file - nextflow_config - Config variable (correctly) not found:
params.nf_required_version - nextflow_config - Config variable (correctly) not found:
params.container - nextflow_config - Config variable (correctly) not found:
params.singleEnd - nextflow_config - Config variable (correctly) not found:
params.igenomesIgnore - nextflow_config - Config variable (correctly) not found:
params.name - nextflow_config - Config variable (correctly) not found:
params.enable_conda - nextflow_config - Config
timeline.enabledhad correct value:true - nextflow_config - Config
report.enabledhad correct value:true - nextflow_config - Config
trace.enabledhad correct value:true - nextflow_config - Config
dag.enabledhad correct value:true - nextflow_config - Config
manifest.namebegan withnf-core/ - nextflow_config - Config variable
manifest.homePagebegan with https://github.com/nf-core/ - nextflow_config - Config
dag.fileended with.html - nextflow_config - Config variable
manifest.nextflowVersionstarted with >= or !>= - nextflow_config - Config
params.custom_config_versionis set tomaster - nextflow_config - Config
params.custom_config_baseis set tohttps://raw.githubusercontent.com/nf-core/configs/master - nextflow_config - Lines for loading custom profiles found
- nextflow_config - nextflow.config contains configuration profile
test - nextflow_config - Config default value correct: params.sample_mapping_fields= experiment_accession,run_accession,sample_accession,experiment_alias,run_alias,sample_alias,experiment_title,sample_title,sample_description
- nextflow_config - Config default value correct: params.nf_core_rnaseq_strandedness= auto
- nextflow_config - Config default value correct: params.download_method= ftp
- nextflow_config - Config default value correct: params.custom_config_version= master
- nextflow_config - Config default value correct: params.custom_config_base= https://raw.githubusercontent.com/nf-core/configs/master
- nextflow_config - Config default value correct: params.max_cpus= 16
- nextflow_config - Config default value correct: params.max_memory= 128.GB
- nextflow_config - Config default value correct: params.max_time= 240.h
- nextflow_config - Config default value correct: params.publish_dir_mode= copy
- nextflow_config - Config default value correct: params.validate_params= true
- files_unchanged -
.gitattributesmatches the template - files_unchanged -
.prettierrc.ymlmatches the template - files_unchanged -
CODE_OF_CONDUCT.mdmatches the template - files_unchanged -
LICENSEmatches the template - files_unchanged -
.github/.dockstore.ymlmatches the template - files_unchanged -
.github/CONTRIBUTING.mdmatches the template - files_unchanged -
.github/ISSUE_TEMPLATE/bug_report.ymlmatches the template - files_unchanged -
.github/ISSUE_TEMPLATE/config.ymlmatches the template - files_unchanged -
.github/ISSUE_TEMPLATE/feature_request.ymlmatches the template - files_unchanged -
.github/PULL_REQUEST_TEMPLATE.mdmatches the template - files_unchanged -
.github/workflows/branch.ymlmatches the template - files_unchanged -
.github/workflows/linting_comment.ymlmatches the template - files_unchanged -
.github/workflows/linting.ymlmatches the template - files_unchanged -
assets/email_template.htmlmatches the template - files_unchanged -
assets/email_template.txtmatches the template - files_unchanged -
assets/nf-core-fetchngs_logo_light.pngmatches the template - files_unchanged -
docs/images/nf-core-fetchngs_logo_light.pngmatches the template - files_unchanged -
docs/images/nf-core-fetchngs_logo_dark.pngmatches the template - files_unchanged -
docs/README.mdmatches the template - files_unchanged -
.gitignorematches the template - files_unchanged -
.prettierignorematches the template - files_unchanged -
pyproject.tomlmatches the template - readme - README Nextflow minimum version badge matched config. Badge:
23.04.0, Config:23.04.0 - readme - README Zenodo placeholder was replaced with DOI.
- pipeline_todos - No TODO strings found
- pipeline_name_conventions - Name adheres to nf-core convention
- template_strings - Did not find any Jinja template strings (147 files)
- schema_lint - Schema lint passed
- schema_lint - Schema title + description lint passed
- schema_lint - Input mimetype lint passed: 'text/csv'
- schema_params - Schema matched params returned from nextflow config
- system_exit - No
System.exitcalls found - actions_schema_validation - Workflow validation passed: branch.yml
- actions_schema_validation - Workflow validation passed: cloud_tests_small.yml
- actions_schema_validation - Workflow validation passed: linting.yml
- actions_schema_validation - Workflow validation passed: ci.yml
- actions_schema_validation - Workflow validation passed: clean-up.yml
- actions_schema_validation - Workflow validation passed: fix-linting.yml
- actions_schema_validation - Workflow validation passed: linting_comment.yml
- actions_schema_validation - Workflow validation passed: download_pipeline.yml
- actions_schema_validation - Workflow validation passed: cloud_tests_full.yml
- actions_schema_validation - Workflow validation passed: release-announcements.yml
- merge_markers - No merge markers found in pipeline files
- modules_json - Only installed modules found in
modules.json - modules_structure - modules directory structure is correct 'modules/nf-core/TOOL/SUBTOOL'
Run details
- nf-core/tools version 2.13
- Run at
2024-02-28 18:07:50
Clarifying the schema added in this PR is the output equivalent of the samplesheet schema schema_input.json here? You have it as input_schema.json above.
I had considered this schema to be defining one of the inputs/outputs of a pipeline. Whereas the nextflow_schema.json in the base directory of the repo defines all the possible inputs. Is that correct?
From what I can tell, the samplesheet.csv contains all of the metadata, including the file paths, MD5 checksums, and id mappings. So the samplesheet and the fastq files comprise the essential outputs and everything else is duplication.
This is unique to fetchngs - I would just ignore it and pretend it doesn't exist.
It's not clear to me what this adds.
- Pipeline developer writes code to generate samplesheet in pipeline 1
- Pipeline developer writes code to read samplesheet in pipeline 2
- Optional: use a samplesheet schema with nf-validation
Where does this file fit in?
@evanfloden this output schema is like the nextflow_schema.json with the schema_input.json embedded. So it lists all of the outputs but each samplesheet output has it's own schema embedded instead of in a separate file for simplicitly.
@adamrtalbot At the very least, this output schema should be used to validate any samplesheets that are produced, and allow external tools like Seqera Platform to inspect a workflow's expected outputs e.g. for the purpose of chaining pipelines.
What isn't clear to me yet is whether the output schema can be used to automate the generation of the samplesheet.
What isn't clear to me yet is whether the output schema can be used to automate the generation of the samplesheet.
Not clear to me either. fetchngs uses an exec process to do this, I think this is quite an overhead for every pipeline developer to do.
Perhaps something like this could work:
my_channel
.toSamplesheet(schema: 'output_schema.json', format: 'csv')
Although it's not clear how you go from channel contents to file contents.
I think that could work. As long as the channel emits maps (or records once we support record types properly), generating the samplesheet is trivial.
This looks going in the right direction. One thing I found awful in the current schema is JSON schema that's totally unreadable. Wonder if we should not look into a different system more human friendly
My biggest concern with this is how unwieldy that file is going to get when we go to defining an output schema from a very simple pipeline like fetchngs to rnaseq. This is why I was suggesting we try and incorporate the publishing logic and output file definition at the module/subworkflow/workflow level and then combine them somehow rather than having one single massive file.
I also suspect there will still need to be some sort of "conversion" layer or plugin that can take this output schema file to generate custom csvs/jsons etc which can be used as input downstream for other pipelines. Ideally, this plugin can be invoked outside of the pipeline context.
but each samplesheet output has it's own schema embedded instead of in a separate file for simplicitly.
I don't think that we should do this, it breaks how JSON schema validation works. The beauty of using the standard is that very many platforms and libraries use the syntax in the same way. You have a parsed object in memory (be it params of the contents of a sample sheet, doesn't really matter) and you validate it against a schema.
If we start merging sample sheet schema inside output schema, we can no longer use this for validation. We would have to validate the output files with subsets of the schema, and validate the list of output files with a subset of the schema. If you have to break the schema down to use it, it becomes custom and a lot less useful imho. Separate files is undoubtably more verbose, but it's also much more portable.
This is why the nextflow_schema.json for params refers to the path to a separate schema file for any given files, rather than embedding that logic within.
@pditommaso YAML is fine (and Ben's YAML conversion here hopefully is a lot easier to read), but my strong preference is to stick with as-close-to-as-possible JSON Schema syntax.
To clarify, that JSON Schema can be written in YAML (or toml, or really any format), as long as it's laid out with the structure and keywords of JSON schema. The benefit of using it is that there are about a bazillion different implementations so it just works everywhere.
In contrast, the Yamale syntax you linked to seems to by a Python tool with it's own schema syntax, so every part of our toolchain would need to build its own parser and validation library for that syntax.
The YAML Schema you linked to seems to still be valid JSON Schema, just in YAML format and with a couple of extra keys. That would still work with any JSON Schema implementation, so that'd be fine. But I'm not sure that we're doing anything complex enough to need those extra keywords to be honest.
nf-core lint overall result: Passed :white_check_mark: :warning:
Posted for pipeline commit 9beb5ea
+| ✅ 155 tests passed |+
#| ❔ 5 tests were ignored |#
!| ❗ 4 tests had warnings |!
:heavy_exclamation_mark: Test warnings:
- files_exist - File not found:
assets/multiqc_config.yml - files_exist - File not found:
conf/igenomes.config - files_exist - File not found:
.github/workflows/awstest.yml - files_exist - File not found:
.github/workflows/awsfulltest.yml
:grey_question: Tests ignored:
- files_exist - File is ignored:
conf/modules.config - files_unchanged - File ignored due to lint config:
assets/sendmail_template.txt - actions_ci - actions_ci
- actions_awstest - 'awstest.yml' workflow not found:
/home/runner/work/fetchngs/fetchngs/.github/workflows/awstest.yml - multiqc_config - 'assets/multiqc_config.yml' not found
:white_check_mark: Tests passed:
- files_exist - File found:
.gitattributes - files_exist - File found:
.gitignore - files_exist - File found:
.nf-core.yml - files_exist - File found:
.editorconfig - files_exist - File found:
.prettierignore - files_exist - File found:
.prettierrc.yml - files_exist - File found:
CHANGELOG.md - files_exist - File found:
CITATIONS.md - files_exist - File found:
CODE_OF_CONDUCT.md - files_exist - File found:
LICENSEorLICENSE.mdorLICENCEorLICENCE.md - files_exist - File found:
nextflow_schema.json - files_exist - File found:
nextflow.config - files_exist - File found:
README.md - files_exist - File found:
.github/.dockstore.yml - files_exist - File found:
.github/CONTRIBUTING.md - files_exist - File found:
.github/ISSUE_TEMPLATE/bug_report.yml - files_exist - File found:
.github/ISSUE_TEMPLATE/config.yml - files_exist - File found:
.github/ISSUE_TEMPLATE/feature_request.yml - files_exist - File found:
.github/PULL_REQUEST_TEMPLATE.md - files_exist - File found:
.github/workflows/branch.yml - files_exist - File found:
.github/workflows/ci.yml - files_exist - File found:
.github/workflows/linting_comment.yml - files_exist - File found:
.github/workflows/linting.yml - files_exist - File found:
assets/email_template.html - files_exist - File found:
assets/email_template.txt - files_exist - File found:
assets/sendmail_template.txt - files_exist - File found:
assets/nf-core-fetchngs_logo_light.png - files_exist - File found:
conf/test.config - files_exist - File found:
conf/test_full.config - files_exist - File found:
docs/images/nf-core-fetchngs_logo_light.png - files_exist - File found:
docs/images/nf-core-fetchngs_logo_dark.png - files_exist - File found:
docs/output.md - files_exist - File found:
docs/README.md - files_exist - File found:
docs/README.md - files_exist - File found:
docs/usage.md - files_exist - File found:
main.nf - files_exist - File found:
conf/base.config - files_exist - File found:
modules.json - files_exist - File found:
pyproject.toml - files_exist - File not found check:
Singularity - files_exist - File not found check:
parameters.settings.json - files_exist - File not found check:
pipeline_template.yml - files_exist - File not found check:
.nf-core.yaml - files_exist - File not found check:
bin/markdown_to_html.r - files_exist - File not found check:
conf/aws.config - files_exist - File not found check:
.github/workflows/push_dockerhub.yml - files_exist - File not found check:
.github/ISSUE_TEMPLATE/bug_report.md - files_exist - File not found check:
.github/ISSUE_TEMPLATE/feature_request.md - files_exist - File not found check:
docs/images/nf-core-fetchngs_logo.png - files_exist - File not found check:
.markdownlint.yml - files_exist - File not found check:
.yamllint.yml - files_exist - File not found check:
lib/Checks.groovy - files_exist - File not found check:
lib/Completion.groovy - files_exist - File not found check:
lib/Workflow.groovy - files_exist - File not found check:
lib/Utils.groovy - files_exist - File not found check:
lib/WorkflowMain.groovy - files_exist - File not found check:
lib/NfcoreTemplate.groovy - files_exist - File not found check:
lib/WorkflowFetchngs.groovy - files_exist - File not found check:
lib/nfcore_external_java_deps.jar - files_exist - File not found check:
.travis.yml - nextflow_config - Config variable found:
manifest.name - nextflow_config - Config variable found:
manifest.nextflowVersion - nextflow_config - Config variable found:
manifest.description - nextflow_config - Config variable found:
manifest.version - nextflow_config - Config variable found:
manifest.homePage - nextflow_config - Config variable found:
timeline.enabled - nextflow_config - Config variable found:
trace.enabled - nextflow_config - Config variable found:
report.enabled - nextflow_config - Config variable found:
dag.enabled - nextflow_config - Config variable found:
process.cpus - nextflow_config - Config variable found:
process.memory - nextflow_config - Config variable found:
process.time - nextflow_config - Config variable found:
params.outdir - nextflow_config - Config variable found:
params.input - nextflow_config - Config variable found:
params.validationShowHiddenParams - nextflow_config - Config variable found:
params.validationSchemaIgnoreParams - nextflow_config - Config variable found:
manifest.mainScript - nextflow_config - Config variable found:
timeline.file - nextflow_config - Config variable found:
trace.file - nextflow_config - Config variable found:
report.file - nextflow_config - Config variable found:
dag.file - nextflow_config - Config variable (correctly) not found:
params.nf_required_version - nextflow_config - Config variable (correctly) not found:
params.container - nextflow_config - Config variable (correctly) not found:
params.singleEnd - nextflow_config - Config variable (correctly) not found:
params.igenomesIgnore - nextflow_config - Config variable (correctly) not found:
params.name - nextflow_config - Config variable (correctly) not found:
params.enable_conda - nextflow_config - Config
timeline.enabledhad correct value:true - nextflow_config - Config
report.enabledhad correct value:true - nextflow_config - Config
trace.enabledhad correct value:true - nextflow_config - Config
dag.enabledhad correct value:true - nextflow_config - Config
manifest.namebegan withnf-core/ - nextflow_config - Config variable
manifest.homePagebegan with https://github.com/nf-core/ - nextflow_config - Config
dag.fileended with.html - nextflow_config - Config variable
manifest.nextflowVersionstarted with >= or !>= - nextflow_config - Config
manifest.versionends indev:1.13.0dev - nextflow_config - Config
params.custom_config_versionis set tomaster - nextflow_config - Config
params.custom_config_baseis set tohttps://raw.githubusercontent.com/nf-core/configs/master - nextflow_config - Lines for loading custom profiles found
- nextflow_config - nextflow.config contains configuration profile
test - nextflow_config - Config default value correct: params.sample_mapping_fields= experiment_accession,run_accession,sample_accession,experiment_alias,run_alias,sample_alias,experiment_title,sample_title,sample_description
- nextflow_config - Config default value correct: params.nf_core_rnaseq_strandedness= auto
- nextflow_config - Config default value correct: params.download_method= ftp
- nextflow_config - Config default value correct: params.custom_config_version= master
- nextflow_config - Config default value correct: params.custom_config_base= https://raw.githubusercontent.com/nf-core/configs/master
- nextflow_config - Config default value correct: params.max_cpus= 16
- nextflow_config - Config default value correct: params.max_memory= 128.GB
- nextflow_config - Config default value correct: params.max_time= 240.h
- nextflow_config - Config default value correct: params.publish_dir_mode= copy
- nextflow_config - Config default value correct: params.validate_params= true
- files_unchanged -
.gitattributesmatches the template - files_unchanged -
.prettierrc.ymlmatches the template - files_unchanged -
CODE_OF_CONDUCT.mdmatches the template - files_unchanged -
LICENSEmatches the template - files_unchanged -
.github/.dockstore.ymlmatches the template - files_unchanged -
.github/CONTRIBUTING.mdmatches the template - files_unchanged -
.github/ISSUE_TEMPLATE/bug_report.ymlmatches the template - files_unchanged -
.github/ISSUE_TEMPLATE/config.ymlmatches the template - files_unchanged -
.github/ISSUE_TEMPLATE/feature_request.ymlmatches the template - files_unchanged -
.github/PULL_REQUEST_TEMPLATE.mdmatches the template - files_unchanged -
.github/workflows/branch.ymlmatches the template - files_unchanged -
.github/workflows/linting_comment.ymlmatches the template - files_unchanged -
.github/workflows/linting.ymlmatches the template - files_unchanged -
assets/email_template.htmlmatches the template - files_unchanged -
assets/email_template.txtmatches the template - files_unchanged -
assets/nf-core-fetchngs_logo_light.pngmatches the template - files_unchanged -
docs/images/nf-core-fetchngs_logo_light.pngmatches the template - files_unchanged -
docs/images/nf-core-fetchngs_logo_dark.pngmatches the template - files_unchanged -
docs/README.mdmatches the template - files_unchanged -
.gitignorematches the template - files_unchanged -
.prettierignorematches the template - files_unchanged -
pyproject.tomlmatches the template - readme - README Nextflow minimum version badge matched config. Badge:
23.04.0, Config:23.04.0 - readme - README Zenodo placeholder was replaced with DOI.
- pipeline_todos - No TODO strings found
- pipeline_name_conventions - Name adheres to nf-core convention
- template_strings - Did not find any Jinja template strings (149 files)
- schema_lint - Schema lint passed
- schema_lint - Schema title + description lint passed
- schema_lint - Input mimetype lint passed: 'text/csv'
- schema_params - Schema matched params returned from nextflow config
- system_exit - No
System.exitcalls found - actions_schema_validation - Workflow validation passed: branch.yml
- actions_schema_validation - Workflow validation passed: cloud_tests_small.yml
- actions_schema_validation - Workflow validation passed: linting.yml
- actions_schema_validation - Workflow validation passed: ci.yml
- actions_schema_validation - Workflow validation passed: clean-up.yml
- actions_schema_validation - Workflow validation passed: fix-linting.yml
- actions_schema_validation - Workflow validation passed: linting_comment.yml
- actions_schema_validation - Workflow validation passed: download_pipeline.yml
- actions_schema_validation - Workflow validation passed: cloud_tests_full.yml
- actions_schema_validation - Workflow validation passed: release-announcements.yml
- merge_markers - No merge markers found in pipeline files
- modules_json - Only installed modules found in
modules.json - modules_structure - modules directory structure is correct 'modules/nf-core/TOOL/SUBTOOL'
Run details
- nf-core/tools version 2.13.1
- Run at
2024-03-01 18:00:10
I converted the JSON schema to YAML just to see what it looks like, and it is indeed much simpler. If the JSON schema can be used with YAML "schemas" just the same, that seems like the best approach to me, even for the nextflow_schema.json.
I also added a prototype for the workflow output DSL (see https://github.com/nextflow-io/nextflow/pull/4784). It allows you to define an arbitrarily nested directory structure with path(), then publish process outputs with select() using a process selector and the standard publish options. It is, in my opinion, stupid simple
Another idea I considered was being able to select channels from the top-level workflow emits, but that is slightly more complicated to implement (and adds some boilerplate to the pipeline code) whereas I found I could get the job done with just the process outputs.
I thought about having some DSL method like index <source-channel> <filename> which could collect metadata records from a channel and write them to a file. It's actually pretty trivial to do with Groovy, fetchngs was just doing it in a roundabout way, so I simplified some things in the pipeline code.
At this point, the output DSL is concerned only with mapping process outputs to a directory structure. Where output schemas could come in is as an optional refinement to describe the structure of specific files:
select 'SRA_TO_SAMPLESHEET', pattern: 'samplesheet.csv', schema: 'schema_samplesheet.json'
select 'SRA_TO_SAMPLESHEET', pattern: 'id_mapping.csv', schema: 'schema_mapping.json'
So it's still up to the user to generate the output file, and they might even be able to use the same output schema to do it (like Adam's toSamplesheet() example). But the above definition can be used by external tools and users to understand the structure of workflow outputs without running the pipeline.
Given this example, I agree with @ewels that it makes more sense to keep the schema for each file separate. I'm imaging a nextflow command to generate some kind of global schema from this output definition (i.e. by the pipeline developer before a version release) for use by external tools.
See https://github.com/nf-core/rnaseq/pull/1227 for a similar prototype with rnaseq. It is not for the faint of heart
I guess the schema would be optional right, a nice support? The learning curve for Nextflow is already high to new-comers and asking for JSON understanding on top should be nice to have rather than a must have.
How would publishing using task variables and other workflow variables look like?
Does a workflow user have the ability to change these settings some how, or are they fixed? Can they be augmented, for example if I want to publish some files also to another folder?
What if I don't have a samplesheet as input? As in https://github.com/nf-core/pangenome.
What if I don't have a samplesheet as input? As in https://github.com/nf-core/pangenome.
This was my concern, but you could generate a params.{json,yaml} as input and send it to the next pipeline. If your inputs aren't defined enough to do this I'm not sure what you could do?
I think the point here isn't to enforce a particular output format, but allow pipeline developers to write any output, then validate it with a schema, similar to how nf-validation does it for inputs. If you don't want to use it, you can ignore it but you'll miss out on the nice validation aspects.
Did I misunderstand and this is just for validating output, and not replacing publishDir?
I guess the schema would be optional right, a nice support?
Yes the output schema is optional. The goal is primarily to replace publishDir and secondarily provide a way to validate outputs
How would publishing using task variables and other workflow variables look like?
The output DSL will be scoped to the script or workflow block (not sure which one yet), so it will be able to use any variables in that scope. Task variables aren't supported since the publishing is decoupled from the individual tasks
Does a workflow user have the ability to change these settings some how, or are they fixed? Can they be augmented, for example if I want to publish some files also to another folder?
Because the output definition is part of the pipeline code, it can't really be changed beyond the use of params. You can provide a param to toggle some output, but if the user wants to add something new, they'll need to fork the pipeline
What if I don't have a samplesheet as input?
The output schema can be used with many different file formats (JSON, YAML, CSV), not just samplesheets, and if your pipeline input can't be easily validated with JSON schema then you don't have to use it, it's optional
I have a simple FASTA.gz as input. Should be easily doable, if I understand correctly.
If it's just a single input file then you can probably provide any metadata through params, so there would be no need for an index file (i.e. samplesheet) or schema to validate it
How would publishing using task variables and other workflow variables look like?
The output DSL will be scoped to the script or workflow block (not sure which one yet), so it will be able to use any variables in that scope. Task variables aren't supported since the publishing is decoupled from the individual tasks
Doesn't this take away a major feature like publishing files to a folder based on sample name? I think there are quite a few examples where some field of meta for example is used in the publish path.
Doesn't this take away a major feature like publishing files to a folder based on sample name? I think there are quite a few examples where some field of meta for example is used in the publish path.
Good point, being able to publish files as results/sample1/bam/sample1.bam is a requirement. Presumably this would work?
path( "results" ) {
select 'SAMTOOLS_SORT', pattern: '*.bam', saveAs: { "${meta.id}/bam/${it}" }
}
For what it's worth, this is another good example to drive publishing from channels rather than processes, because then the vals would be in scope. You can see that in action here: https://github.com/nf-core/fetchngs/pull/302/files
Doesn't this take away a major feature like publishing files to a folder based on sample name?
I think I have seen this pattern before, though I couldn't find an example of it in rnaseq.
It is a consequence of decoupling the publishing from the task execution. We might be able to recover it in #302 by allowing the path to reference channel items, e.g. given a channel of files with metadata, publish the file to a path based on the meta id, but not sure what that syntax would look like.
@adamrtalbot good point, with channel selectors we could do something like this:
path( "results" ) {
select NFCORE_RNASEQ.out.bam, saveAs: { meta, bam -> "${meta.id}/bam/${bam.name}" }
}
The only thing is, I was imagining the selected channel would just provide paths, but if they provide tuples/records with files and metadata, it's not obvious how the file elements are being pulled out of the tuple.
it's not obvious how the file elements are being pulled out of the tuple.
Isn't this how it is now? Only path types are published. val, env, etc are ignored.
I think I have seen this pattern before, though I couldn't find an example of it in rnaseq.
It's unusual in nf-core, but quite common elsewhere.
The only thing is, I was imagining the selected channel would just provide paths, but if they provide tuples/records with files and metadata, it's not obvious how the file elements are being pulled out of the tuple.
My thought was, capture all the contents of the channel, publish only the file-like objects. Then we can dump all the contents to a log of some description, similar to how nf-test does it in snapshots (snippet below for anyone who hasn't seen one).
{
"with_umi": {
"content": [
[
[
{
"id": "test",
"single_end": true
},
"test.fastp.fastq.gz:md5,ba8c6c3a7ce718d9a2c5857e2edf53bc"
]
],
[
[
{
"id": "test",
"single_end": true
},
"test.fastp.json:md5,d39c5c6d9a2e35fb60d26ced46569af6"
]
],
// etc
"meta": {
"nf-test": "0.8.4",
"nextflow": "23.10.1"
},
"timestamp": "2024-03-18T17:31:09.193212"
}
}