nextflow
nextflow copied to clipboard
Inconsistent behaviour for glob process outputs
Bug report
Expected behavior and actual behavior
According to, https://www.nextflow.io/docs/latest/process.html#multiple-output-files, a glob pattern can be used to emit a list of multiple items from a process into an output channel. What is not clear from the documentation is that if only a single file matches the glob, a length-1 list is not emitted but rather the plain value.
Returning different types from a function (process in this instance) is typically considered bad practice and burdens the caller with having to perform introspection. In the context of of Nextflow this means any operator on a channel first needs to check the returned type, possibly wrap the item thats assumed to be a list, and then safely perform operations such as mapping over the list.
Steps to reproduce the problem
nextflow.enable.dsl=2
process touchOne {
input:
val name
output:
path "item*"
script:
"""
echo "Hello $name"
touch item1
"""
}
process touchTwo {
input:
val name
output:
path "item*"
script:
"""
echo "Hello $name"
touch item1
touch item2
"""
}
workflow {
touchOne(1).view()
touchTwo(2).view()
}
Program output
(Copy and paste here output produced by the failing execution. Please highlight it as a code block. Whenever possible upload the .nextflow.log
file.)
Environment
- Nextflow version: version 20.10.0 build 5430
- Java version: OpenJDK Runtime Environment (build 1.8.0_282-8u282-b08-0ubuntu1~16.04-b08)
- Operating system: Linux
- Bash version: GNU bash, version 4.3.48(1)-release
Additional context
I can see there being an argument about making a breaking change to Nextflow by enforcing that all globs return a list. Despite Nextflow's eschewing of heavy pattern-matching of file artifacts, I suspect the globbing is obused a lot to emit single files such that a change to emitting lists will break much exisiting code.
However, I do think that at least optionally globbing outputs should be forceable to be lists. I suspect theres some corner cases around what a lenth-0 list might mean and how thats handled. Minimally the documentation needs updating to highlight that the returned type is not always a list.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I'm not aware of any changes in consideration of this issue.
I agree that this can make workflows brittle, can this issue be re-opened?
I agree with your sentiment. It is poor form for a function to return a different type based on the outcome.
Perhaps we could add an option to path
so that users can opt out if they want:
output:
path "item*", squeeze: false
The squeeze
option should be true by default, but if users want a glob to always return a list then they can set the option to false.
Inspired by numpy.squeeze
That might be a good compromise if you aren't willing to break backwards compatibility.
For my money I would break compatibility. I've not been in the Nextflow game long compared to many others, but I came across this bug the first time I tried to write a non-linear (but still fairly trivial) workflow with a process
used to create dynamic inputs for downstream processing. This is the sort of dynamism that Nextflow espouses as a key benefit over say snakemake (where dynamic rules are historically rather clunky). It's almost a must in such a system to have well-typed functions, else we're endlessly boxing return values.
In trivial cases best practice is presumably not to use a glob but more tightly match a single output directly by name (either fixed or through a variable). In code review I would definitely question a:
output:
path "item*"
as being lazy (or a potential error) and request a replacement to something more specific.
As soon as you are outside the realm of the trivial, it seems to me you will always want squeeze: false
. The only use case I can immediately think of is some program outside of your control generating a (truly) random file name --- which is rather silly in itself, maybe timestamped files are a better example. I'd love to see examples of how these globs are used in practice and how many times they could/should be replaced with something more specific.
So for me a default of squeeze: false
makes more sense, because what I perceive as a justifiable use case is to expect a multiplicity of values. This breaks backwards compatibility, but only in the sense that there are likely better ways to achieve what users were doing. Being able to set squeeze: true
remains for edge cases like the curious examples above.
The thing is that Nextflow already has a pretty aggressive update schedule, and it is known as such. I think people tolerate it because it's easy to switch Nextflow versions, but even so, we try not to push people herder than we already do. If we implement the squeeze
option, we may very well change the default to false
after a few releases, but there has to be a transition period where users are made aware of a feature and given the opportunity to use it before it becomes the default.
Duplicate of #1236
Well spotted. Here the plan was/is to extend the output (and input) declaration with the cardinality of the expected file to be captured e.g. cardinality: 1
for just one element , cardinality: 2
exactly two, cardinality: 1..*
one or more, etc.
This would serve both to resolve the ambiguity of the result type and to validate the number of files expected.
update: the cardinality
is a mere example, a better naming should be found
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Fixed by https://github.com/nextflow-io/nextflow/commit/42504d3c83145e16179e22df83e9d35ecc995eca