nextflow icon indicating copy to clipboard operation
nextflow copied to clipboard

Optional inputs for DSL2

Open illusional opened this issue 3 years ago • 18 comments

New feature

Pinging @rsuchecki @pditommaso (I couldn't find an issue with this, I hope it's okay that I open a new one).

Based on a small conversation on the Gitter (1 - primary | 2), there's interest (a lot from me) to have more direct support for optional inputs - this seems is inline with the goals of DSL2 to produce reusable tool modules / interfaces.

Other workflow specifications have the concept of tool wrappers, which aim to be a "write once, use in all of your workflows". This means the tool wrapper would contain most (if not all) available configuration options, which then the command line is dynamically constructed. This allows the community to build and contribute high quality tool wrappers, for example: Common Workflow Library (CWLibrary#fastqc), BioWDL (BioWDL#fastqc) with the tools available for other users to use, or upload to stores like Dockstore or the Galaxy toolshed.

Projects like aCLImatise aim to generate tool wrappers, as this process is usually a significant time consuming aspect of building workflows.

The DSL2 makes good strides towards this, and a stronger concept for optional inputs would take this further.

Relevant discussion:

Command line construction sidenote

I think it would be a bad idea to create a new syntax for building or interpolating command lines, but tool developers could use the groovy environment to build strings for each command option.

Usage scenario

Consider fastqc (eg: nf-core module definition), which might have the (simplified) command structure:

fastqc \
    [-c contaminant file] \
    [ ... other config options ] \
    seqfile1 .. seqfileN

I could build a process definition to encapsulate these ways to optionally configure the tool.

This process definition is just hypothetical, just one way I could think to do it.

process FASTQC {
    input:
        tuple val(name),
        Optional[path(contaminant)],
        path(reads)

    output:
        path("*.zip"), emit: zip

    script:
    contaminant_script = (contaminant != null) ? "--contaminant ${contaminant}" : ""
    reads_script = reads.join(' ')
    """
    fastqc \
        ${contaminant_script} \
        ${reads_script}
    """
}

But usage of imported modules in DSL2 in a workflow requires positional arguments, so you would have something like:

include { FASTQC as fastqc } from './tools/fastqc'

workflow {
    fastqc(params.name, null, params.reads)
}

Suggest implementation

As @rsuchecki noted in gitter:

Things are very flexible for val inputs, but understandably get more complex when files/paths are involved as they need to be staged. Tuples are nice and keep things organised but are still an extension of the same idea of positional inputs.

I'd hope to avoid the use of positional arguments, because you can't ascertain context for a variable.

illusional avatar Aug 04 '20 05:08 illusional

There are also some tools that can have multiple types of input files (actually any combination of those inputs). As such, none of them are mandatory, but you need at least one. For instance, if we look at read assemblers such as megahit, you can do either:

# Case 1: paired-end reads
megahit -1 sample1_R1.fastq.gz,sample2_R1.fastq.gz -2 sample1_R2.fastq.gz,sample2_R2.fastq.gz

# Case 2: paired-end, interleaved reads
megahit --12 sample1.fastq.gz,sample2.fastq.gz

# Case 3: single-end reads
megahit -r reads_single.fastq.gz 

# Case 4: multiple input types combined
megahit -1 sample1_paired_R1.fastq.gz,sample2_paired_R1.fastq.gz \
        -2 sample1_paired_R2.fastq.gz,sample2_paired_R2.fastq.gz \
        -r sample1_unpaired.fastq.gz,sample2_unpaired.fastq.gz

# And more...

Lately I had trouble handling this case with the DSL2 syntax in a clean way.

Puumanamana avatar Aug 12 '20 00:08 Puumanamana

I managed to find a solution (not as clean as I would have hoped). https://github.com/nf-core/sarek/blob/a7679b9b5c178351b1e96a3ffe7ee81ddf9aad06/main.nf#L226

Which I later use in a clean manner in a process: https://github.com/nf-core/sarek/blob/dsl2/modules/nf-core/software/qualimap_bamqc.nf

maxulysse avatar Aug 27 '20 08:08 maxulysse

Yep, this would be really nice. Using NO_FILE as suggested here doesn't work for optional inputs on AWS as @apeltzer found.

Another solution is to have a dummy file in the pipeline repo that you can stage if the actual file isn't required in the process e.g. initiated here and used here.

This also means you won't have to write anything to the results directory as suggested by @MaxUlysse.

drpatelh avatar Aug 29 '20 10:08 drpatelh

It looks like there are a couple of common workarounds :

  • Files without a value (so pure optional inputs) - placeholder file
  • Ability to pass some set of configuration options - seems a few people use val(meta)

But maybe also recognising a few common patterns of arguments which tools may require to better wrap a "tool interface":

  • Mutually exclusive sets of arguments.
  • accepted values (range or set of values)

Just nudging @rsuchecki and @pditommaso to see if you guys have any thoughts.

illusional avatar Sep 04 '20 01:09 illusional

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Feb 01 '21 04:02 stale[bot]

I would like to see such a feature. @drpatelh did you find any hack to make it work?

maxulysse avatar Feb 01 '21 08:02 maxulysse

I haven't explored some of the workarounds listed above, but I also agree that implementing some form of optional input syntax for DSL2 would be very useful.

bioinfomagician avatar Feb 01 '21 20:02 bioinfomagician

I haven't I'm afraid. I have resorted to staging "dummy" files to bypass this. See discussion here. Maybe there is a better solution.

drpatelh avatar Feb 01 '21 21:02 drpatelh

Not ideal, but another workaround to use an optional input without having to stage a dummy file is to pass an empty list as the input path.

This script worked on aws batch:

nextflow.enable.dsl=2

process CAT_FILES {
  input:
    path files_to_cat // list of paths
    path optional // optional file

  output:
    path 'out.txt'

  script:
    def args = ['cat']
    files_to_cat.each { args.add(it) }
    if (optional) args.add(optional[0]) // or optional.each { args.add(it) }
    args.add("> out.txt")
    args.join(' ')
}

workflow {
  CAT_FILES(['file1.txt', 'file2.txt'], [])
}

An optional path is just a list of path with size 1 or 0.

mjhipp avatar Mar 03 '21 00:03 mjhipp

Wanting to bump this - having clear syntax for optional inputs would be really helpful.

CharlotteAnne avatar Jun 09 '21 09:06 CharlotteAnne

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Nov 06 '21 17:11 stale[bot]

Bump

pditommaso avatar Nov 09 '21 08:11 pditommaso

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Apr 16 '22 01:04 stale[bot]

Related https://github.com/nextflow-io/nextflow/pull/2710

pditommaso avatar Apr 16 '22 08:04 pditommaso

Coming back to bump again ;)

CharlotteAnne avatar Jul 12 '23 06:07 CharlotteAnne

I just encoutered this in kallisto quant module and had to change the module's main.nf (which I'd rather avoid). totally support this issue!

DariiaVyshenska avatar Mar 29 '24 21:03 DariiaVyshenska