nextflow
nextflow copied to clipboard
Modifying a mutable object (like Map) in a duplicated channel affects all instances
Bug report
Expected behavior and actual behavior
I don't know if this is a bug or just a huge gotcha. Either way, I've been affected by it multiple times and so have others. DSL2 is supposed to automatically copy/duplicate channels when they get re-used multiple times. In general, this works perfectly fine. However, when the channel contains a mutable objects like a Map
and those get modified in place, this affects all channel instances. I don't know the underlying code but I guess for memory reason channels get copied only shallowly, i.e., they reference the same objects. This can lead to more or less subtle bugs in pipelines and surprising behavior.
Steps to reproduce the problem
Consider the following workflow:
#!/usr/bin/env nextflow
nextflow.enable.dsl = 2
process ECHO {
input:
val meta
output:
val meta
script:
"""
echo '${meta.id}'
"""
}
process ECHO2 {
input:
val meta
output:
val meta
script:
"""
echo '${meta.id}'
"""
}
workflow {
def ch_input = Channel.of([id: 'test1', idx: 1], [id: 'test2', idx: 2])
ECHO(ch_input).view()
ECHO2(
ch_input.map { meta ->
meta.id = 'foo'
meta
}
).view()
}
This produces below output where all id
keys contain the 'foo'
value. If instead you clone the map, the output is more in line with what I expect.
#!/usr/bin/env nextflow
nextflow.enable.dsl = 2
process ECHO {
input:
val meta
output:
val meta
script:
"""
echo '${meta.id}'
"""
}
process ECHO2 {
input:
val meta
output:
val meta
script:
"""
echo '${meta.id}'
"""
}
workflow {
def ch_input = Channel.of([id: 'test1', idx: 1], [id: 'test2', idx: 2])
ECHO(ch_input).view()
ECHO2(
ch_input.map { meta ->
def dup = meta.clone()
dup.id = 'foo'
dup
}
).view()
}
Program output
N E X T F L O W ~ version 21.10.6
Launching `mutable.nf` [friendly_koch] - revision: 49b9711e9b
executor > local (4)
[98/ed1c1a] process > ECHO (1) [100%] 2 of 2 ✔
[a9/4e05a2] process > ECHO2 (2) [100%] 2 of 2 ✔
[id:foo, idx:2]
[id:foo, idx:1]
[id:foo, idx:2]
[id:foo, idx:1]
Output for the "cloned" version:
N E X T F L O W ~ version 21.10.6
Launching `mutable.nf` [romantic_bhabha] - revision: 6df7bba35e
executor > local (4)
[0a/3aed82] process > ECHO (1) [100%] 2 of 2 ✔
[c3/029147] process > ECHO2 (1) [100%] 2 of 2 ✔
[id:test2, idx:2]
[id:test1, idx:1]
[id:foo, idx:2]
[id:foo, idx:1]
Environment
- Nextflow version: 21.10.6
- Java version: openjdk 11.0.13 2021-10-19
- Operating system: Linux 5.15.15-76051515-generic #202201160435~1642693824~20.04~97db1bb-Ubuntu
- Bash version: GNU bash, version 5.0.17(1)-release (x86_64-pc-linux-gnu)
I've hit this problem as well, is there a workaround?
The workaround is to clone the map as shown in the second example.
Ah sorry, clearly not paying enough attention! Using .clone()
has fixed it.
For what it's worth, here are my details
- Nextflow version: 21.10.6
- Java version: openjdk 1.8.0
- OS: Ubuntu 18.04.6 LTS
- BASH: 4.4.20
Good thing to add to the docs under a section on troubleshooting or gotchas.
Been thinking about this issue. Fundamentally the mutable object needs to be cloned, the question is whether Nextflow should do it automatically or the user should do it when they need it.
I am leaning towards the latter approach. For Nextflow to perform a deep copy every time a channel is referenced (i.e. implicit into
from DSL1) would add needless overhead in most cases, and in the worst case might copy some large in-memory object (e.g. a CSV file loaded into memory) that either takes a while or causes an OOM error. Also, mutating state in this way kinda goes against the functional paradigm employed by Nextflow.
So I think the best solution is to just use a functional style that doesn't require mutation. Here is your example reworked to produce a new map without mutating or using clone()
at all:
ch_input.map { meta ->
meta + [id: 'foo']
}
See this article
From the linked article:
The easiest way to merge two maps in Groovy is to use + operator. This method is straightforward - it creates a new map from the left-hand-side and right-hand-side maps.
that means it is equivalent to using clone and the modifying the state. On one hand, style-wise, I agree that your suggestion is nice since it can modify multiple keys in one statement. On the other hand, it may not be immediately obvious to every reader of the code that values of the LHS for keys in RHS get overwritten by RHS' values.
Personally, I would then probably prefer to implement a 'merge' function with explanatory docstring just as in the article and use that.
Hey guys, just want to note in this thread as well that .clone() does not work on maps that carry a groupKey.
This works:
#!/usr/bin/env nextflow
nextflow.enable.dsl = 1
/*
* Defines the pipeline inputs parameters
*/
Channel
.from( [ "sample_name":"sample_A" ], [ "sample_name":"sample_B" ], [ "sample_name":"sample_C" ] )
.set { samples_ch }
Channel
.from( ["scaffolds", "file_s"], ["contigs", "file_c"])
.set { genomes_ch }
samples_ch
.combine( genomes_ch )
.map { sample_info, genome_name, index ->
def tmp = sample_info.clone()
tmp.genomeName = genome_name
return [tmp, tuple(genome_name), index] }
.view()
This does not work:
#!/usr/bin/env nextflow
nextflow.enable.dsl = 1
/*
* Defines the pipeline inputs parameters
*/
Channel
.from( [ "sample_name":"sample_A" ], [ "sample_name":"sample_B" ], [ "sample_name":"sample_C" ] )
.set { samples_ch }
Channel
.from( ["scaffolds", "file_s"], ["contigs", "file_c"])
.set { genomes_ch }
samples_ch
.combine( genomes_ch )
.map { sample_info, genome_name, index -> tuple(groupKey(sample_info, 1), genome_name, index )}
.map { sample_info, genome_name, index ->
def tmp = sample_info.clone()
tmp.genomeName = genome_name
return [tmp, tuple(genome_name), index] }
.view()
@Midnighter I think the +
syntax will work great for most people, but it's yet another example of how I think the Nextflow docs just need to have more comprehensive docs on Groovy syntax, instead of only linking to other sources.
@diego-rt Would the +
syntax I showed above solve your problem without having to use clone()
?
Actually yes, the +
syntax seems to work just fine. It doesn't seem to preserve the groupKey of the object though.
Btw, also another thing that seems to be missing AFAIK is a way to remove the groupKey from an object.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
In order to keep this alive, just wanted to comment that I've started using the +
operator more and more @bentsherman . I've grown to like it a lot. It's very concise yet explicit about what's happening.