nextflow icon indicating copy to clipboard operation
nextflow copied to clipboard

Behaviour of `.collect()` and `.toList()`

Open cjw85 opened this issue 1 year ago • 11 comments

Bug report

The semantics of these two operators are poorly documented and counter intuitive. From the documentation its not clear what the different between .collect() and .toList(), as the examples given are identical. .toList() does no in fact return a list.

Expected behavior and actual behavior

.collect()s behaviour is to flatten a single level of structure in the input items. This is undocumented and seems arbitrary.

The reasonably expected behaviour would be simply to return a value channel with all items from the input queue channel as a single emission.

.toList() returns a channel where it might reasonably be expected to return a... um, well, list.

Steps to reproduce the problem

nextflow.enable.dsl=2

workflow {
    ch1 = Channel.of(
        [1,2],
        [3,4,[5,6]],
        7,
        8
    )
    ch1.view()
    ch1.collect().view()
}

Program output

N E X T F L O W  ~  version 22.04.4
Launching `temp1.nf` [distraught_shockley] DSL2 - revision: 021b10ff49
[1, 2]
[3, 4, [5, 6]]
7
8
[1, 2, 3, 4, [5, 6], 7, 8]

Expected output would be:

[[1,2], [3, 4, [5,6]], 7, 8]

which is what .toList() returns.

Environment

  • Nextflow version: 22.04.4
  • Java version: java version "17.0.2" 2022-01-18 LTS
  • Operating system: macOS
  • Bash version: N/A

Additional context

(Add any other context about the problem here)

cjw85 avatar Aug 08 '22 17:08 cjw85

I discovered the same thing recently. What I found was that collect and toList are almost identical, except that collect has two hidden options flat and sort, flat is true by default, and these options are in the operator docs but commented out. 🙁 I have uncommented them in #2816 as part of updating the docs to DSL2.

However I think the two operators are otherwise the same. They both emit a single list item. Every operator returns one or more channels due to their asynchronous nature.

By the way, thank you for raising these issues about language semantics. I tend to agree with you about the various inconsistencies in Nextflow. I think the language just developed organically and is in need of some cleanup. DSL2 was a huge step in that direction, but operators in particular need to be reviewed. In this case, I think we should keep either collect or toList and deprecate the other, and maybe deprecate toSortedList as well. Having all of these different but mostly similar operators just causes a lot of confusion.

bentsherman avatar Aug 08 '22 20:08 bentsherman

I noted similar behaviour previously with other operators: https://github.com/nextflow-io/nextflow/discussions/2437#discussion-3675155

This was originally and issue then got moved to a discussion for reasons unknown.

cjw85 avatar Aug 08 '22 20:08 cjw85

I guess discussions have better longevity because they don't get reaped by the stalebot. Better for threads that don't have specific action items from the start. I also find these three operators confusing. It seems like we could have a single operator with an option like type: 'inner' | 'outer'. Or at least combine cross and combine into a single operator.

bentsherman avatar Aug 08 '22 20:08 bentsherman

combine is a weird one because it has two (I would say) unrelated behaviours as illustrated in the docs: without a join key it's a Cartesian product (all pairwise combinations), with a join key it's an inner product.

To me cross is confusing because both the name and the format of the output are more reminiscent of a Cartesian product, and yet it is operating as an inner join.

It's also not clear from the docs what is the difference between join and the second form of combine. The example for the former uses unique keys whereas the latter has duplicate keys, but the descriptions are in essence identical.

Anyways, we're off topic. I'll end by saying that the situation is so bad that my group had contemplated writing our own library of operators with more rational names and semantics!

cjw85 avatar Aug 08 '22 21:08 cjw85

Well combine and cross both do outer products, but the matching key allows you to do a separate outer product for each key. If you look at the examples you'll see that within each matching key an outer product is happening. join is just doing an inner product but it's eerily similar to groupTuple. I think combine and cross are basically doing the same thing.

Anyway, I don't mind veering off because it's part of the same problem. And I'm not surprised! I'm thinking about documenting these inconsistencies in one place, so that we can find a holistic solution that considers the full library of operators.

bentsherman avatar Aug 08 '22 22:08 bentsherman

The example given for transpose is not anything I would recognise as a transpose from linear algebra. I'm hard pressed to even describe what it's doing.

I can see from the example what the description is getting at but it's just kinda weird the way the the a and b are spread across the rows of the transposed matrix; and relatedly how the transposed matrix for each singleton isn't returned as two 2-tuples. Again like the product operators there's some unintuitive flattening/expansion of nested tuples.

I think I've tried using it in the past and found it very clumsy to use in practice.

cjw85 avatar Aug 08 '22 22:08 cjw85

Okay I've added my thoughts on all of these operators to the mega-discussion (#3107). We'll see if someone shows up with a use case for transpose but I think we can probably just get rid of it.

bentsherman avatar Aug 10 '22 19:08 bentsherman

and break 400 pipelines https://github.com/search?q=transpose+extension%3Anf

pditommaso avatar Aug 10 '22 19:08 pditommaso

To give a bit more insight about the transpose, it was motivated by the need to have an operation in some extent opposite to groupBy. See here.

the name was taken from the corresponding groovy method for list. See here

Regarding toList and collect the latter was motivated by the need to have an operator that similarly to toList would produce all items in a single emission. However collect produce nothing, when there no input. Instead toList produces an empty list.

I agree it this difference should be highlighted (actually I was convinced it was)

pditommaso avatar Aug 10 '22 20:08 pditommaso

Thanks for clearing that up with transpose.

As for collect and toList, I don't see this difference noted in the docs, but I was able to reproduce the behavior you said. Still, it's the same issue as distinct and unique, they have slightly different behavior but you can't tell from the names which is which, it's kinda arbitrary. Think it would be better to express this difference with an option like ifEmpty instead of two different but nearly identical operators.

Or, deprecate toList and suggest collect().ifEmpty([]) instead.

bentsherman avatar Aug 10 '22 20:08 bentsherman

Regarding toList and collect the latter was motivated by the need to have an operator that similarly to toList would produce all items in a single emission. However collect produce nothing, when there no input. Instead toList produces an empty list.

This strikes at what is possibly a source of confusion, many of the operator are compound operators which could be split into more atomic parts. The one-level flattening of collect() seems like bug given your description. The fact it has a non-default flat=false option smells like a retrofit fix without wanting to change the established default.

After a further 20 minutes thought I can at least describe the transpose operator:

  • for each key
  • transpose the matrix of values
  • take each row of the matrix as a new emission
  • spread the key across the new emissions

The problem is, theres a far simpler alternative:

  • for each key
  • transpose the matrix

Which for the 2D case in the documentation example would give:

[a, [p, u], [q, v]]
[b, [s, x], [t, y]]

and a simple 1D case we go from a 1xn list to a nx1 list.

When viewed this way, transpose as implemented is a compound operator, it transposes the values, and spreads the result across a key. A simple inverse of groupTuple might be called plainly spread, defined as the first dimension of the value matrix being spread across the key.

Incidentally in the search you performed theres a number of examples where a transpose is immediately followed by a groupTuple, which (I think) results in my example above.

cjw85 avatar Aug 10 '22 21:08 cjw85

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jan 16 '23 01:01 stale[bot]

Closing since we clarified the docs for collect and toList in #3276 .

bentsherman avatar Jan 17 '23 15:01 bentsherman