fetchngs Workflow publish definition

This PR is a prototype for defining an output schema for a Nextflow pipeline. See https://github.com/nextflow-io/nextflow/issues/4669 and https://github.com/nextflow-io/nextflow/issues/4670 for original discussions.

The meta workflow that we are targeting is:

fetchngs -> rnaseq -> differentialabundance

In other words, we want to eliminate the manual curation of samplesheets between each pipeline. To do this, the output schema should "mirror" the params schema, it should describe the outputs as a collection of samplesheets.

Here is the tree of pipeline outputs for the fetchngs test profile:

/custom/user-settings.mkfg
/fastq/DRX024467_DRR026872.fastq.gz
/fastq/DRX026011_DRR028935_1.fastq.gz
/fastq/DRX026011_DRR028935_2.fastq.gz
/fastq/ERX1234253_ERR1160846.fastq.gz
/fastq/SRX10940790_SRR14593545_1.fastq.gz
/fastq/SRX10940790_SRR14593545_2.fastq.gz
/fastq/SRX11047067_SRR14709033.fastq.gz
/fastq/SRX17709227_SRR21711856.fastq.gz
/fastq/SRX17709228_SRR21711855.fastq.gz
/fastq/SRX6725035_SRR9984183.fastq.gz
/fastq/SRX9315476_SRR12848126_1.fastq.gz
/fastq/SRX9315476_SRR12848126_2.fastq.gz
/fastq/SRX9504942_SRR13055517_1.fastq.gz
/fastq/SRX9504942_SRR13055517_2.fastq.gz
/fastq/SRX9504942_SRR13055518_1.fastq.gz
/fastq/SRX9504942_SRR13055518_2.fastq.gz
/fastq/SRX9504942_SRR13055519_1.fastq.gz
/fastq/SRX9504942_SRR13055519_2.fastq.gz
/fastq/SRX9504942_SRR13055520_1.fastq.gz
/fastq/SRX9504942_SRR13055520_2.fastq.gz
/fastq/SRX9626017_SRR13191702_1.fastq.gz
/fastq/SRX9626017_SRR13191702_2.fastq.gz
/fastq/md5/DRX024467_DRR026872.fastq.gz.md5
/fastq/md5/DRX026011_DRR028935_1.fastq.gz.md5
/fastq/md5/DRX026011_DRR028935_2.fastq.gz.md5
/fastq/md5/ERX1234253_ERR1160846.fastq.gz.md5
/fastq/md5/SRX17709227_SRR21711856.fastq.gz.md5
/fastq/md5/SRX17709228_SRR21711855.fastq.gz.md5
/fastq/md5/SRX6725035_SRR9984183.fastq.gz.md5
/fastq/md5/SRX9504942_SRR13055517_1.fastq.gz.md5
/fastq/md5/SRX9504942_SRR13055517_2.fastq.gz.md5
/fastq/md5/SRX9504942_SRR13055518_1.fastq.gz.md5
/fastq/md5/SRX9504942_SRR13055518_2.fastq.gz.md5
/fastq/md5/SRX9504942_SRR13055519_1.fastq.gz.md5
/fastq/md5/SRX9504942_SRR13055519_2.fastq.gz.md5
/fastq/md5/SRX9504942_SRR13055520_1.fastq.gz.md5
/fastq/md5/SRX9504942_SRR13055520_2.fastq.gz.md5
/fastq/md5/SRX9626017_SRR13191702_1.fastq.gz.md5
/fastq/md5/SRX9626017_SRR13191702_2.fastq.gz.md5
/metadata/DRR026872.runinfo_ftp.tsv
/metadata/DRR028935.runinfo_ftp.tsv
/metadata/ERR1160846.runinfo_ftp.tsv
/metadata/GSE214215.runinfo_ftp.tsv
/metadata/GSM4907283.runinfo_ftp.tsv
/metadata/SRR12848126.runinfo_ftp.tsv
/metadata/SRR13191702.runinfo_ftp.tsv
/metadata/SRR14593545.runinfo_ftp.tsv
/metadata/SRR14709033.runinfo_ftp.tsv
/metadata/SRR9984183.runinfo_ftp.tsv
/samplesheet/id_mappings.csv
/samplesheet/multiqc_config.yml
/samplesheet/samplesheet.csv

From what I can tell, the samplesheet.csv contains all of the metadata, including the file paths, MD5 checksums, and id mappings. So the samplesheet and the fastq files comprise the essential outputs and everything else is duplication.

The initial output schema basically describes this samplesheet in a similar manner to the input_schema.json file. This particular output schema should closely resemble the input_schema.json for nf-core/rnaseq.

What I'd to do from here is collect feedback on this approach -- what else is needed to complete the output schema for this pipeline? Then we can think about how to operationalize it in Nextflow -- should Nextflow automatically generate the samplesheet from the schema? how does the schema interact with the publish mechanism? how to collect metadata which normally can't be published directly but only through files?

Feb 13 '24 21:02 bentsherman

`nf-core lint` overall result: Passed :white_check_mark: :warning:

Posted for pipeline commit b8dd9e2

+| ✅ 154 tests passed       |+
#| ❔   5 tests were ignored |#
!| ❗   5 tests had warnings |!

:heavy_exclamation_mark: Test warnings:

files_exist - File not found: assets/multiqc_config.yml
files_exist - File not found: conf/igenomes.config
files_exist - File not found: .github/workflows/awstest.yml
files_exist - File not found: .github/workflows/awsfulltest.yml
nextflow_config - Config manifest.version should end in dev: 1.12.0

:grey_question: Tests ignored:

files_exist - File is ignored: conf/modules.config
files_unchanged - File ignored due to lint config: assets/sendmail_template.txt
actions_ci - actions_ci
actions_awstest - 'awstest.yml' workflow not found: /home/runner/work/fetchngs/fetchngs/.github/workflows/awstest.yml
multiqc_config - 'assets/multiqc_config.yml' not found

:white_check_mark: Tests passed:

files_exist - File found: .gitattributes
files_exist - File found: .gitignore
files_exist - File found: .nf-core.yml
files_exist - File found: .editorconfig
files_exist - File found: .prettierignore
files_exist - File found: .prettierrc.yml
files_exist - File found: CHANGELOG.md
files_exist - File found: CITATIONS.md
files_exist - File found: CODE_OF_CONDUCT.md
files_exist - File found: LICENSE or LICENSE.md or LICENCE or LICENCE.md
files_exist - File found: nextflow_schema.json
files_exist - File found: nextflow.config
files_exist - File found: README.md
files_exist - File found: .github/.dockstore.yml
files_exist - File found: .github/CONTRIBUTING.md
files_exist - File found: .github/ISSUE_TEMPLATE/bug_report.yml
files_exist - File found: .github/ISSUE_TEMPLATE/config.yml
files_exist - File found: .github/ISSUE_TEMPLATE/feature_request.yml
files_exist - File found: .github/PULL_REQUEST_TEMPLATE.md
files_exist - File found: .github/workflows/branch.yml
files_exist - File found: .github/workflows/ci.yml
files_exist - File found: .github/workflows/linting_comment.yml
files_exist - File found: .github/workflows/linting.yml
files_exist - File found: assets/email_template.html
files_exist - File found: assets/email_template.txt
files_exist - File found: assets/sendmail_template.txt
files_exist - File found: assets/nf-core-fetchngs_logo_light.png
files_exist - File found: conf/test.config
files_exist - File found: conf/test_full.config
files_exist - File found: docs/images/nf-core-fetchngs_logo_light.png
files_exist - File found: docs/images/nf-core-fetchngs_logo_dark.png
files_exist - File found: docs/output.md
files_exist - File found: docs/README.md
files_exist - File found: docs/README.md
files_exist - File found: docs/usage.md
files_exist - File found: main.nf
files_exist - File found: conf/base.config
files_exist - File found: modules.json
files_exist - File found: pyproject.toml
files_exist - File not found check: Singularity
files_exist - File not found check: parameters.settings.json
files_exist - File not found check: pipeline_template.yml
files_exist - File not found check: .nf-core.yaml
files_exist - File not found check: bin/markdown_to_html.r
files_exist - File not found check: conf/aws.config
files_exist - File not found check: .github/workflows/push_dockerhub.yml
files_exist - File not found check: .github/ISSUE_TEMPLATE/bug_report.md
files_exist - File not found check: .github/ISSUE_TEMPLATE/feature_request.md
files_exist - File not found check: docs/images/nf-core-fetchngs_logo.png
files_exist - File not found check: .markdownlint.yml
files_exist - File not found check: .yamllint.yml
files_exist - File not found check: lib/Checks.groovy
files_exist - File not found check: lib/Completion.groovy
files_exist - File not found check: lib/Workflow.groovy
files_exist - File not found check: lib/Utils.groovy
files_exist - File not found check: lib/WorkflowMain.groovy
files_exist - File not found check: lib/NfcoreTemplate.groovy
files_exist - File not found check: lib/WorkflowFetchngs.groovy
files_exist - File not found check: lib/nfcore_external_java_deps.jar
files_exist - File not found check: .travis.yml
nextflow_config - Config variable found: manifest.name
nextflow_config - Config variable found: manifest.nextflowVersion
nextflow_config - Config variable found: manifest.description
nextflow_config - Config variable found: manifest.version
nextflow_config - Config variable found: manifest.homePage
nextflow_config - Config variable found: timeline.enabled
nextflow_config - Config variable found: trace.enabled
nextflow_config - Config variable found: report.enabled
nextflow_config - Config variable found: dag.enabled
nextflow_config - Config variable found: process.cpus
nextflow_config - Config variable found: process.memory
nextflow_config - Config variable found: process.time
nextflow_config - Config variable found: params.outdir
nextflow_config - Config variable found: params.input
nextflow_config - Config variable found: params.validationShowHiddenParams
nextflow_config - Config variable found: params.validationSchemaIgnoreParams
nextflow_config - Config variable found: manifest.mainScript
nextflow_config - Config variable found: timeline.file
nextflow_config - Config variable found: trace.file
nextflow_config - Config variable found: report.file
nextflow_config - Config variable found: dag.file
nextflow_config - Config variable (correctly) not found: params.nf_required_version
nextflow_config - Config variable (correctly) not found: params.container
nextflow_config - Config variable (correctly) not found: params.singleEnd
nextflow_config - Config variable (correctly) not found: params.igenomesIgnore
nextflow_config - Config variable (correctly) not found: params.name
nextflow_config - Config variable (correctly) not found: params.enable_conda
nextflow_config - Config timeline.enabled had correct value: true
nextflow_config - Config report.enabled had correct value: true
nextflow_config - Config trace.enabled had correct value: true
nextflow_config - Config dag.enabled had correct value: true
nextflow_config - Config manifest.name began with nf-core/
nextflow_config - Config variable manifest.homePage began with https://github.com/nf-core/
nextflow_config - Config dag.file ended with .html
nextflow_config - Config variable manifest.nextflowVersion started with >= or !>=
nextflow_config - Config params.custom_config_version is set to master
nextflow_config - Config params.custom_config_base is set to https://raw.githubusercontent.com/nf-core/configs/master
nextflow_config - Lines for loading custom profiles found
nextflow_config - nextflow.config contains configuration profile test
nextflow_config - Config default value correct: params.sample_mapping_fields= experiment_accession,run_accession,sample_accession,experiment_alias,run_alias,sample_alias,experiment_title,sample_title,sample_description
nextflow_config - Config default value correct: params.nf_core_rnaseq_strandedness= auto
nextflow_config - Config default value correct: params.download_method= ftp
nextflow_config - Config default value correct: params.custom_config_version= master
nextflow_config - Config default value correct: params.custom_config_base= https://raw.githubusercontent.com/nf-core/configs/master
nextflow_config - Config default value correct: params.max_cpus= 16
nextflow_config - Config default value correct: params.max_memory= 128.GB
nextflow_config - Config default value correct: params.max_time= 240.h
nextflow_config - Config default value correct: params.publish_dir_mode= copy
nextflow_config - Config default value correct: params.validate_params= true
files_unchanged - .gitattributes matches the template
files_unchanged - .prettierrc.yml matches the template
files_unchanged - CODE_OF_CONDUCT.md matches the template
files_unchanged - LICENSE matches the template
files_unchanged - .github/.dockstore.yml matches the template
files_unchanged - .github/CONTRIBUTING.md matches the template
files_unchanged - .github/ISSUE_TEMPLATE/bug_report.yml matches the template
files_unchanged - .github/ISSUE_TEMPLATE/config.yml matches the template
files_unchanged - .github/ISSUE_TEMPLATE/feature_request.yml matches the template
files_unchanged - .github/PULL_REQUEST_TEMPLATE.md matches the template
files_unchanged - .github/workflows/branch.yml matches the template
files_unchanged - .github/workflows/linting_comment.yml matches the template
files_unchanged - .github/workflows/linting.yml matches the template
files_unchanged - assets/email_template.html matches the template
files_unchanged - assets/email_template.txt matches the template
files_unchanged - assets/nf-core-fetchngs_logo_light.png matches the template
files_unchanged - docs/images/nf-core-fetchngs_logo_light.png matches the template
files_unchanged - docs/images/nf-core-fetchngs_logo_dark.png matches the template
files_unchanged - docs/README.md matches the template
files_unchanged - .gitignore matches the template
files_unchanged - .prettierignore matches the template
files_unchanged - pyproject.toml matches the template
readme - README Nextflow minimum version badge matched config. Badge: 23.04.0, Config: 23.04.0
readme - README Zenodo placeholder was replaced with DOI.
pipeline_todos - No TODO strings found
pipeline_name_conventions - Name adheres to nf-core convention
template_strings - Did not find any Jinja template strings (147 files)
schema_lint - Schema lint passed
schema_lint - Schema title + description lint passed
schema_lint - Input mimetype lint passed: 'text/csv'
schema_params - Schema matched params returned from nextflow config
system_exit - No System.exit calls found
actions_schema_validation - Workflow validation passed: branch.yml
actions_schema_validation - Workflow validation passed: cloud_tests_small.yml
actions_schema_validation - Workflow validation passed: linting.yml
actions_schema_validation - Workflow validation passed: ci.yml
actions_schema_validation - Workflow validation passed: clean-up.yml
actions_schema_validation - Workflow validation passed: fix-linting.yml
actions_schema_validation - Workflow validation passed: linting_comment.yml
actions_schema_validation - Workflow validation passed: download_pipeline.yml
actions_schema_validation - Workflow validation passed: cloud_tests_full.yml
actions_schema_validation - Workflow validation passed: release-announcements.yml
merge_markers - No merge markers found in pipeline files
modules_json - Only installed modules found in modules.json
modules_structure - modules directory structure is correct 'modules/nf-core/TOOL/SUBTOOL'

Run details

nf-core/tools version 2.13
Run at 2024-02-28 18:07:50

Feb 13 '24 21:02 github-actions[bot]

Clarifying the schema added in this PR is the output equivalent of the samplesheet schema schema_input.json here? You have it as input_schema.json above.

I had considered this schema to be defining one of the inputs/outputs of a pipeline. Whereas the nextflow_schema.json in the base directory of the repo defines all the possible inputs. Is that correct?

Feb 14 '24 07:02 evanfloden

From what I can tell, the samplesheet.csv contains all of the metadata, including the file paths, MD5 checksums, and id mappings. So the samplesheet and the fastq files comprise the essential outputs and everything else is duplication.

This is unique to fetchngs - I would just ignore it and pretend it doesn't exist.

Feb 14 '24 09:02 adamrtalbot

It's not clear to me what this adds.

Pipeline developer writes code to generate samplesheet in pipeline 1
Pipeline developer writes code to read samplesheet in pipeline 2
- Optional: use a samplesheet schema with nf-validation

Where does this file fit in?

Feb 14 '24 09:02 adamrtalbot

@evanfloden this output schema is like the nextflow_schema.json with the schema_input.json embedded. So it lists all of the outputs but each samplesheet output has it's own schema embedded instead of in a separate file for simplicitly.

@adamrtalbot At the very least, this output schema should be used to validate any samplesheets that are produced, and allow external tools like Seqera Platform to inspect a workflow's expected outputs e.g. for the purpose of chaining pipelines.

What isn't clear to me yet is whether the output schema can be used to automate the generation of the samplesheet.

Feb 14 '24 14:02 bentsherman

What isn't clear to me yet is whether the output schema can be used to automate the generation of the samplesheet.

Not clear to me either. fetchngs uses an exec process to do this, I think this is quite an overhead for every pipeline developer to do.

Perhaps something like this could work:

my_channel
    .toSamplesheet(schema: 'output_schema.json', format: 'csv')

Although it's not clear how you go from channel contents to file contents.

Feb 14 '24 14:02 adamrtalbot

I think that could work. As long as the channel emits maps (or records once we support record types properly), generating the samplesheet is trivial.

Feb 14 '24 15:02 bentsherman

This looks going in the right direction. One thing I found awful in the current schema is JSON schema that's totally unreadable. Wonder if we should not look into a different system more human friendly

Feb 19 '24 12:02 pditommaso

Possible alternatives

Feb 19 '24 13:02 pditommaso

My biggest concern with this is how unwieldy that file is going to get when we go to defining an output schema from a very simple pipeline like fetchngs to rnaseq. This is why I was suggesting we try and incorporate the publishing logic and output file definition at the module/subworkflow/workflow level and then combine them somehow rather than having one single massive file.

I also suspect there will still need to be some sort of "conversion" layer or plugin that can take this output schema file to generate custom csvs/jsons etc which can be used as input downstream for other pipelines. Ideally, this plugin can be invoked outside of the pipeline context.

Feb 21 '24 12:02 drpatelh

but each samplesheet output has it's own schema embedded instead of in a separate file for simplicitly.

I don't think that we should do this, it breaks how JSON schema validation works. The beauty of using the standard is that very many platforms and libraries use the syntax in the same way. You have a parsed object in memory (be it params of the contents of a sample sheet, doesn't really matter) and you validate it against a schema.

If we start merging sample sheet schema inside output schema, we can no longer use this for validation. We would have to validate the output files with subsets of the schema, and validate the list of output files with a subset of the schema. If you have to break the schema down to use it, it becomes custom and a lot less useful imho. Separate files is undoubtably more verbose, but it's also much more portable.

This is why the nextflow_schema.json for params refers to the path to a separate schema file for any given files, rather than embedding that logic within.

Feb 28 '24 18:02 ewels

@pditommaso YAML is fine (and Ben's YAML conversion here hopefully is a lot easier to read), but my strong preference is to stick with as-close-to-as-possible JSON Schema syntax.

To clarify, that JSON Schema can be written in YAML (or toml, or really any format), as long as it's laid out with the structure and keywords of JSON schema. The benefit of using it is that there are about a bazillion different implementations so it just works everywhere.

In contrast, the Yamale syntax you linked to seems to by a Python tool with it's own schema syntax, so every part of our toolchain would need to build its own parser and validation library for that syntax.

The YAML Schema you linked to seems to still be valid JSON Schema, just in YAML format and with a couple of extra keys. That would still work with any JSON Schema implementation, so that'd be fine. But I'm not sure that we're doing anything complex enough to need those extra keywords to be honest.

Feb 28 '24 18:02 ewels

`nf-core lint` overall result: Passed :white_check_mark: :warning:

Posted for pipeline commit 9beb5ea

+| ✅ 155 tests passed       |+
#| ❔   5 tests were ignored |#
!| ❗   4 tests had warnings |!

:heavy_exclamation_mark: Test warnings:

files_exist - File not found: assets/multiqc_config.yml
files_exist - File not found: conf/igenomes.config
files_exist - File not found: .github/workflows/awstest.yml
files_exist - File not found: .github/workflows/awsfulltest.yml

:grey_question: Tests ignored:

files_exist - File is ignored: conf/modules.config
files_unchanged - File ignored due to lint config: assets/sendmail_template.txt
actions_ci - actions_ci
actions_awstest - 'awstest.yml' workflow not found: /home/runner/work/fetchngs/fetchngs/.github/workflows/awstest.yml
multiqc_config - 'assets/multiqc_config.yml' not found

:white_check_mark: Tests passed:

files_exist - File found: .gitattributes
files_exist - File found: .gitignore
files_exist - File found: .nf-core.yml
files_exist - File found: .editorconfig
files_exist - File found: .prettierignore
files_exist - File found: .prettierrc.yml
files_exist - File found: CHANGELOG.md
files_exist - File found: CITATIONS.md
files_exist - File found: CODE_OF_CONDUCT.md
files_exist - File found: LICENSE or LICENSE.md or LICENCE or LICENCE.md
files_exist - File found: nextflow_schema.json
files_exist - File found: nextflow.config
files_exist - File found: README.md
files_exist - File found: .github/.dockstore.yml
files_exist - File found: .github/CONTRIBUTING.md
files_exist - File found: .github/ISSUE_TEMPLATE/bug_report.yml
files_exist - File found: .github/ISSUE_TEMPLATE/config.yml
files_exist - File found: .github/ISSUE_TEMPLATE/feature_request.yml
files_exist - File found: .github/PULL_REQUEST_TEMPLATE.md
files_exist - File found: .github/workflows/branch.yml
files_exist - File found: .github/workflows/ci.yml
files_exist - File found: .github/workflows/linting_comment.yml
files_exist - File found: .github/workflows/linting.yml
files_exist - File found: assets/email_template.html
files_exist - File found: assets/email_template.txt
files_exist - File found: assets/sendmail_template.txt
files_exist - File found: assets/nf-core-fetchngs_logo_light.png
files_exist - File found: conf/test.config
files_exist - File found: conf/test_full.config
files_exist - File found: docs/images/nf-core-fetchngs_logo_light.png
files_exist - File found: docs/images/nf-core-fetchngs_logo_dark.png
files_exist - File found: docs/output.md
files_exist - File found: docs/README.md
files_exist - File found: docs/README.md
files_exist - File found: docs/usage.md
files_exist - File found: main.nf
files_exist - File found: conf/base.config
files_exist - File found: modules.json
files_exist - File found: pyproject.toml
files_exist - File not found check: Singularity
files_exist - File not found check: parameters.settings.json
files_exist - File not found check: pipeline_template.yml
files_exist - File not found check: .nf-core.yaml
files_exist - File not found check: bin/markdown_to_html.r
files_exist - File not found check: conf/aws.config
files_exist - File not found check: .github/workflows/push_dockerhub.yml
files_exist - File not found check: .github/ISSUE_TEMPLATE/bug_report.md
files_exist - File not found check: .github/ISSUE_TEMPLATE/feature_request.md
files_exist - File not found check: docs/images/nf-core-fetchngs_logo.png
files_exist - File not found check: .markdownlint.yml
files_exist - File not found check: .yamllint.yml
files_exist - File not found check: lib/Checks.groovy
files_exist - File not found check: lib/Completion.groovy
files_exist - File not found check: lib/Workflow.groovy
files_exist - File not found check: lib/Utils.groovy
files_exist - File not found check: lib/WorkflowMain.groovy
files_exist - File not found check: lib/NfcoreTemplate.groovy
files_exist - File not found check: lib/WorkflowFetchngs.groovy
files_exist - File not found check: lib/nfcore_external_java_deps.jar
files_exist - File not found check: .travis.yml
nextflow_config - Config variable found: manifest.name
nextflow_config - Config variable found: manifest.nextflowVersion
nextflow_config - Config variable found: manifest.description
nextflow_config - Config variable found: manifest.version
nextflow_config - Config variable found: manifest.homePage
nextflow_config - Config variable found: timeline.enabled
nextflow_config - Config variable found: trace.enabled
nextflow_config - Config variable found: report.enabled
nextflow_config - Config variable found: dag.enabled
nextflow_config - Config variable found: process.cpus
nextflow_config - Config variable found: process.memory
nextflow_config - Config variable found: process.time
nextflow_config - Config variable found: params.outdir
nextflow_config - Config variable found: params.input
nextflow_config - Config variable found: params.validationShowHiddenParams
nextflow_config - Config variable found: params.validationSchemaIgnoreParams
nextflow_config - Config variable found: manifest.mainScript
nextflow_config - Config variable found: timeline.file
nextflow_config - Config variable found: trace.file
nextflow_config - Config variable found: report.file
nextflow_config - Config variable found: dag.file
nextflow_config - Config variable (correctly) not found: params.nf_required_version
nextflow_config - Config variable (correctly) not found: params.container
nextflow_config - Config variable (correctly) not found: params.singleEnd
nextflow_config - Config variable (correctly) not found: params.igenomesIgnore
nextflow_config - Config variable (correctly) not found: params.name
nextflow_config - Config variable (correctly) not found: params.enable_conda
nextflow_config - Config timeline.enabled had correct value: true
nextflow_config - Config report.enabled had correct value: true
nextflow_config - Config trace.enabled had correct value: true
nextflow_config - Config dag.enabled had correct value: true
nextflow_config - Config manifest.name began with nf-core/
nextflow_config - Config variable manifest.homePage began with https://github.com/nf-core/
nextflow_config - Config dag.file ended with .html
nextflow_config - Config variable manifest.nextflowVersion started with >= or !>=
nextflow_config - Config manifest.version ends in dev: 1.13.0dev
nextflow_config - Config params.custom_config_version is set to master
nextflow_config - Config params.custom_config_base is set to https://raw.githubusercontent.com/nf-core/configs/master
nextflow_config - Lines for loading custom profiles found
nextflow_config - nextflow.config contains configuration profile test
nextflow_config - Config default value correct: params.sample_mapping_fields= experiment_accession,run_accession,sample_accession,experiment_alias,run_alias,sample_alias,experiment_title,sample_title,sample_description
nextflow_config - Config default value correct: params.nf_core_rnaseq_strandedness= auto
nextflow_config - Config default value correct: params.download_method= ftp
nextflow_config - Config default value correct: params.custom_config_version= master
nextflow_config - Config default value correct: params.custom_config_base= https://raw.githubusercontent.com/nf-core/configs/master
nextflow_config - Config default value correct: params.max_cpus= 16
nextflow_config - Config default value correct: params.max_memory= 128.GB
nextflow_config - Config default value correct: params.max_time= 240.h
nextflow_config - Config default value correct: params.publish_dir_mode= copy
nextflow_config - Config default value correct: params.validate_params= true
files_unchanged - .gitattributes matches the template
files_unchanged - .prettierrc.yml matches the template
files_unchanged - CODE_OF_CONDUCT.md matches the template
files_unchanged - LICENSE matches the template
files_unchanged - .github/.dockstore.yml matches the template
files_unchanged - .github/CONTRIBUTING.md matches the template
files_unchanged - .github/ISSUE_TEMPLATE/bug_report.yml matches the template
files_unchanged - .github/ISSUE_TEMPLATE/config.yml matches the template
files_unchanged - .github/ISSUE_TEMPLATE/feature_request.yml matches the template
files_unchanged - .github/PULL_REQUEST_TEMPLATE.md matches the template
files_unchanged - .github/workflows/branch.yml matches the template
files_unchanged - .github/workflows/linting_comment.yml matches the template
files_unchanged - .github/workflows/linting.yml matches the template
files_unchanged - assets/email_template.html matches the template
files_unchanged - assets/email_template.txt matches the template
files_unchanged - assets/nf-core-fetchngs_logo_light.png matches the template
files_unchanged - docs/images/nf-core-fetchngs_logo_light.png matches the template
files_unchanged - docs/images/nf-core-fetchngs_logo_dark.png matches the template
files_unchanged - docs/README.md matches the template
files_unchanged - .gitignore matches the template
files_unchanged - .prettierignore matches the template
files_unchanged - pyproject.toml matches the template
readme - README Nextflow minimum version badge matched config. Badge: 23.04.0, Config: 23.04.0
readme - README Zenodo placeholder was replaced with DOI.
pipeline_todos - No TODO strings found
pipeline_name_conventions - Name adheres to nf-core convention
template_strings - Did not find any Jinja template strings (149 files)
schema_lint - Schema lint passed
schema_lint - Schema title + description lint passed
schema_lint - Input mimetype lint passed: 'text/csv'
schema_params - Schema matched params returned from nextflow config
system_exit - No System.exit calls found
actions_schema_validation - Workflow validation passed: branch.yml
actions_schema_validation - Workflow validation passed: cloud_tests_small.yml
actions_schema_validation - Workflow validation passed: linting.yml
actions_schema_validation - Workflow validation passed: ci.yml
actions_schema_validation - Workflow validation passed: clean-up.yml
actions_schema_validation - Workflow validation passed: fix-linting.yml
actions_schema_validation - Workflow validation passed: linting_comment.yml
actions_schema_validation - Workflow validation passed: download_pipeline.yml
actions_schema_validation - Workflow validation passed: cloud_tests_full.yml
actions_schema_validation - Workflow validation passed: release-announcements.yml
merge_markers - No merge markers found in pipeline files
modules_json - Only installed modules found in modules.json
modules_structure - modules directory structure is correct 'modules/nf-core/TOOL/SUBTOOL'

Run details

nf-core/tools version 2.13.1
Run at 2024-03-01 18:00:10

Feb 28 '24 21:02 github-actions[bot]

I converted the JSON schema to YAML just to see what it looks like, and it is indeed much simpler. If the JSON schema can be used with YAML "schemas" just the same, that seems like the best approach to me, even for the nextflow_schema.json.

Feb 28 '24 21:02 bentsherman

I also added a prototype for the workflow output DSL (see https://github.com/nextflow-io/nextflow/pull/4784). It allows you to define an arbitrarily nested directory structure with path(), then publish process outputs with select() using a process selector and the standard publish options. It is, in my opinion, stupid simple

Another idea I considered was being able to select channels from the top-level workflow emits, but that is slightly more complicated to implement (and adds some boilerplate to the pipeline code) whereas I found I could get the job done with just the process outputs.

I thought about having some DSL method like index <source-channel> <filename> which could collect metadata records from a channel and write them to a file. It's actually pretty trivial to do with Groovy, fetchngs was just doing it in a roundabout way, so I simplified some things in the pipeline code.

Feb 28 '24 21:02 bentsherman

At this point, the output DSL is concerned only with mapping process outputs to a directory structure. Where output schemas could come in is as an optional refinement to describe the structure of specific files:

select 'SRA_TO_SAMPLESHEET', pattern: 'samplesheet.csv', schema: 'schema_samplesheet.json'
select 'SRA_TO_SAMPLESHEET', pattern: 'id_mapping.csv', schema: 'schema_mapping.json'

So it's still up to the user to generate the output file, and they might even be able to use the same output schema to do it (like Adam's toSamplesheet() example). But the above definition can be used by external tools and users to understand the structure of workflow outputs without running the pipeline.

Given this example, I agree with @ewels that it makes more sense to keep the schema for each file separate. I'm imaging a nextflow command to generate some kind of global schema from this output definition (i.e. by the pipeline developer before a version release) for use by external tools.

Feb 28 '24 21:02 bentsherman

See https://github.com/nf-core/rnaseq/pull/1227 for a similar prototype with rnaseq. It is not for the faint of heart

Feb 29 '24 04:02 bentsherman

I guess the schema would be optional right, a nice support? The learning curve for Nextflow is already high to new-comers and asking for JSON understanding on top should be nice to have rather than a must have.

How would publishing using task variables and other workflow variables look like?

Does a workflow user have the ability to change these settings some how, or are they fixed? Can they be augmented, for example if I want to publish some files also to another folder?

Mar 19 '24 09:03 mahesh-panchal

What if I don't have a samplesheet as input? As in https://github.com/nf-core/pangenome.

Mar 19 '24 09:03 subwaystation

What if I don't have a samplesheet as input? As in https://github.com/nf-core/pangenome.

This was my concern, but you could generate a params.{json,yaml} as input and send it to the next pipeline. If your inputs aren't defined enough to do this I'm not sure what you could do?

I think the point here isn't to enforce a particular output format, but allow pipeline developers to write any output, then validate it with a schema, similar to how nf-validation does it for inputs. If you don't want to use it, you can ignore it but you'll miss out on the nice validation aspects.

Mar 19 '24 11:03 adamrtalbot

Did I misunderstand and this is just for validating output, and not replacing publishDir?

Mar 19 '24 13:03 mahesh-panchal

I guess the schema would be optional right, a nice support?

Yes the output schema is optional. The goal is primarily to replace publishDir and secondarily provide a way to validate outputs

How would publishing using task variables and other workflow variables look like?

The output DSL will be scoped to the script or workflow block (not sure which one yet), so it will be able to use any variables in that scope. Task variables aren't supported since the publishing is decoupled from the individual tasks

Does a workflow user have the ability to change these settings some how, or are they fixed? Can they be augmented, for example if I want to publish some files also to another folder?

Because the output definition is part of the pipeline code, it can't really be changed beyond the use of params. You can provide a param to toggle some output, but if the user wants to add something new, they'll need to fork the pipeline

What if I don't have a samplesheet as input?

The output schema can be used with many different file formats (JSON, YAML, CSV), not just samplesheets, and if your pipeline input can't be easily validated with JSON schema then you don't have to use it, it's optional

Mar 19 '24 13:03 bentsherman

I have a simple FASTA.gz as input. Should be easily doable, if I understand correctly.

Mar 19 '24 14:03 subwaystation

If it's just a single input file then you can probably provide any metadata through params, so there would be no need for an index file (i.e. samplesheet) or schema to validate it

Mar 19 '24 14:03 bentsherman

How would publishing using task variables and other workflow variables look like?

The output DSL will be scoped to the script or workflow block (not sure which one yet), so it will be able to use any variables in that scope. Task variables aren't supported since the publishing is decoupled from the individual tasks

Doesn't this take away a major feature like publishing files to a folder based on sample name? I think there are quite a few examples where some field of meta for example is used in the publish path.

Mar 19 '24 14:03 mahesh-panchal

Doesn't this take away a major feature like publishing files to a folder based on sample name? I think there are quite a few examples where some field of meta for example is used in the publish path.

Good point, being able to publish files as results/sample1/bam/sample1.bam is a requirement. Presumably this would work?

path( "results" ) {
    select 'SAMTOOLS_SORT', pattern: '*.bam', saveAs: { "${meta.id}/bam/${it}" }
}

For what it's worth, this is another good example to drive publishing from channels rather than processes, because then the vals would be in scope. You can see that in action here: https://github.com/nf-core/fetchngs/pull/302/files

Mar 19 '24 15:03 adamrtalbot

Doesn't this take away a major feature like publishing files to a folder based on sample name?

I think I have seen this pattern before, though I couldn't find an example of it in rnaseq.

It is a consequence of decoupling the publishing from the task execution. We might be able to recover it in #302 by allowing the path to reference channel items, e.g. given a channel of files with metadata, publish the file to a path based on the meta id, but not sure what that syntax would look like.

Mar 19 '24 15:03 bentsherman

@adamrtalbot good point, with channel selectors we could do something like this:

path( "results" ) {
    select NFCORE_RNASEQ.out.bam, saveAs: { meta, bam -> "${meta.id}/bam/${bam.name}" }
}

The only thing is, I was imagining the selected channel would just provide paths, but if they provide tuples/records with files and metadata, it's not obvious how the file elements are being pulled out of the tuple.

Mar 19 '24 15:03 bentsherman

it's not obvious how the file elements are being pulled out of the tuple.

Isn't this how it is now? Only path types are published. val, env, etc are ignored.

Mar 19 '24 15:03 mahesh-panchal

I think I have seen this pattern before, though I couldn't find an example of it in rnaseq.

It's unusual in nf-core, but quite common elsewhere.

The only thing is, I was imagining the selected channel would just provide paths, but if they provide tuples/records with files and metadata, it's not obvious how the file elements are being pulled out of the tuple.

My thought was, capture all the contents of the channel, publish only the file-like objects. Then we can dump all the contents to a log of some description, similar to how nf-test does it in snapshots (snippet below for anyone who hasn't seen one).

{
    "with_umi": {
        "content": [
            [
                [
                    {
                        "id": "test",
                        "single_end": true
                    },
                    "test.fastp.fastq.gz:md5,ba8c6c3a7ce718d9a2c5857e2edf53bc"
                ]
            ],
            [
                [
                    {
                        "id": "test",
                        "single_end": true
                    },
                    "test.fastp.json:md5,d39c5c6d9a2e35fb60d26ced46569af6"
                ]
            ],
            // etc
       
       "meta": {
            "nf-test": "0.8.4",
            "nextflow": "23.10.1"
        },
        "timestamp": "2024-03-18T17:31:09.193212"
    }
}

Mar 19 '24 15:03 adamrtalbot

Workflow publish definition

nf-core lint overall result: Passed :white_check_mark: :warning:

:heavy_exclamation_mark: Test warnings:

:grey_question: Tests ignored:

:white_check_mark: Tests passed:

Run details

nf-core lint overall result: Passed :white_check_mark: :warning:

:heavy_exclamation_mark: Test warnings:

:grey_question: Tests ignored:

:white_check_mark: Tests passed:

Run details

`nf-core lint` overall result: Passed :white_check_mark: :warning:

`nf-core lint` overall result: Passed :white_check_mark: :warning: