nextflow Inconsistent behaviour for glob process outputs

Inconsistent behaviour for glob process outputs

Open cjw85 opened this issue 2 years ago • 8 comments

Bug report

Expected behavior and actual behavior

According to, https://www.nextflow.io/docs/latest/process.html#multiple-output-files, a glob pattern can be used to emit a list of multiple items from a process into an output channel. What is not clear from the documentation is that if only a single file matches the glob, a length-1 list is not emitted but rather the plain value.

Returning different types from a function (process in this instance) is typically considered bad practice and burdens the caller with having to perform introspection. In the context of of Nextflow this means any operator on a channel first needs to check the returned type, possibly wrap the item thats assumed to be a list, and then safely perform operations such as mapping over the list.

Steps to reproduce the problem

nextflow.enable.dsl=2

process touchOne {
    input:
        val name
    output:
        path "item*"
    script:
        """
        echo "Hello $name"
        touch item1
        """
}

process touchTwo {
    input:
        val name
    output:
        path "item*"
    script:
        """
        echo "Hello $name"
        touch item1
        touch item2
        """
}

workflow {
    touchOne(1).view()
    touchTwo(2).view()
}

Program output

(Copy and paste here output produced by the failing execution. Please highlight it as a code block. Whenever possible upload the .nextflow.log file.)

Environment

Nextflow version: version 20.10.0 build 5430
Java version: OpenJDK Runtime Environment (build 1.8.0_282-8u282-b08-0ubuntu1~16.04-b08)
Operating system: Linux
Bash version: GNU bash, version 4.3.48(1)-release

Additional context

I can see there being an argument about making a breaking change to Nextflow by enforcing that all globs return a list. Despite Nextflow's eschewing of heavy pattern-matching of file artifacts, I suspect the globbing is obused a lot to emit single files such that a change to emitting lists will break much exisiting code.

However, I do think that at least optionally globbing outputs should be forceable to be lists. I suspect theres some corner cases around what a lenth-0 list might mean and how thats handled. Minimally the documentation needs updating to highlight that the returned type is not always a list.

Nov 01 '21 17:11 cjw85

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Mar 31 '22 20:03 stale[bot]

I'm not aware of any changes in consideration of this issue.

Jun 12 '22 10:06 cjw85

I agree that this can make workflows brittle, can this issue be re-opened?

Jun 15 '22 00:06 ryankelley-lm

I agree with your sentiment. It is poor form for a function to return a different type based on the outcome.

Perhaps we could add an option to path so that users can opt out if they want:

    output:
        path "item*", squeeze: false

The squeeze option should be true by default, but if users want a glob to always return a list then they can set the option to false.

Inspired by numpy.squeeze

Jun 21 '22 14:06 bentsherman

That might be a good compromise if you aren't willing to break backwards compatibility.

For my money I would break compatibility. I've not been in the Nextflow game long compared to many others, but I came across this bug the first time I tried to write a non-linear (but still fairly trivial) workflow with a process used to create dynamic inputs for downstream processing. This is the sort of dynamism that Nextflow espouses as a key benefit over say snakemake (where dynamic rules are historically rather clunky). It's almost a must in such a system to have well-typed functions, else we're endlessly boxing return values.

In trivial cases best practice is presumably not to use a glob but more tightly match a single output directly by name (either fixed or through a variable). In code review I would definitely question a:

output:
    path "item*"

as being lazy (or a potential error) and request a replacement to something more specific.

As soon as you are outside the realm of the trivial, it seems to me you will always want squeeze: false. The only use case I can immediately think of is some program outside of your control generating a (truly) random file name --- which is rather silly in itself, maybe timestamped files are a better example. I'd love to see examples of how these globs are used in practice and how many times they could/should be replaced with something more specific.

So for me a default of squeeze: false makes more sense, because what I perceive as a justifiable use case is to expect a multiplicity of values. This breaks backwards compatibility, but only in the sense that there are likely better ways to achieve what users were doing. Being able to set squeeze: true remains for edge cases like the curious examples above.

Jun 22 '22 09:06 cjw85

The thing is that Nextflow already has a pretty aggressive update schedule, and it is known as such. I think people tolerate it because it's easy to switch Nextflow versions, but even so, we try not to push people herder than we already do. If we implement the squeeze option, we may very well change the default to false after a few releases, but there has to be a transition period where users are made aware of a feature and given the opportunity to use it before it becomes the default.

Jun 22 '22 12:06 bentsherman

Duplicate of #1236

Jul 12 '22 14:07 bentsherman

Well spotted. Here the plan was/is to extend the output (and input) declaration with the cardinality of the expected file to be captured e.g. cardinality: 1 for just one element , cardinality: 2 exactly two, cardinality: 1..* one or more, etc.

This would serve both to resolve the ambiguity of the result type and to validate the number of files expected.

update: the cardinality is a mere example, a better naming should be found

Jul 12 '22 15:07 pditommaso

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Jan 16 '23 01:01 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Aug 12 '23 06:08 stale[bot]

Fixed by https://github.com/nextflow-io/nextflow/commit/42504d3c83145e16179e22df83e9d35ecc995eca

Sep 11 '23 14:09 bentsherman

nextflow nextflow copied to clipboard

Inconsistent behaviour for glob process outputs

Bug report

Expected behavior and actual behavior

Steps to reproduce the problem

Program output

Environment

Additional context

nextflow
nextflow copied to clipboard