nextflow icon indicating copy to clipboard operation
nextflow copied to clipboard

Modifying a mutable object (like Map) in a duplicated channel affects all instances

Open Midnighter opened this issue 3 years ago • 5 comments

Bug report

Expected behavior and actual behavior

I don't know if this is a bug or just a huge gotcha. Either way, I've been affected by it multiple times and so have others. DSL2 is supposed to automatically copy/duplicate channels when they get re-used multiple times. In general, this works perfectly fine. However, when the channel contains a mutable objects like a Map and those get modified in place, this affects all channel instances. I don't know the underlying code but I guess for memory reason channels get copied only shallowly, i.e., they reference the same objects. This can lead to more or less subtle bugs in pipelines and surprising behavior.

Steps to reproduce the problem

Consider the following workflow:

#!/usr/bin/env nextflow

nextflow.enable.dsl = 2

process ECHO {
    input:
    val meta

    output:
    val meta

    script:
    """
    echo '${meta.id}'
    """
}

process ECHO2 {
    input:
    val meta

    output:
    val meta

    script:
    """
    echo '${meta.id}'
    """
}

workflow {
    def ch_input = Channel.of([id: 'test1', idx: 1], [id: 'test2', idx: 2])

    ECHO(ch_input).view()

    ECHO2(
        ch_input.map { meta ->
            meta.id = 'foo'
            meta
        }
    ).view()
}

This produces below output where all id keys contain the 'foo' value. If instead you clone the map, the output is more in line with what I expect.

#!/usr/bin/env nextflow

nextflow.enable.dsl = 2

process ECHO {
    input:
    val meta

    output:
    val meta

    script:
    """
    echo '${meta.id}'
    """
}

process ECHO2 {
    input:
    val meta

    output:
    val meta

    script:
    """
    echo '${meta.id}'
    """
}

workflow {
    def ch_input = Channel.of([id: 'test1', idx: 1], [id: 'test2', idx: 2])

    ECHO(ch_input).view()

    ECHO2(
        ch_input.map { meta ->
            def dup = meta.clone()
            dup.id = 'foo'
            dup
        }
    ).view()
}

Program output

N E X T F L O W  ~  version 21.10.6
Launching `mutable.nf` [friendly_koch] - revision: 49b9711e9b
executor >  local (4)
[98/ed1c1a] process > ECHO (1)  [100%] 2 of 2 ✔
[a9/4e05a2] process > ECHO2 (2) [100%] 2 of 2 ✔
[id:foo, idx:2]
[id:foo, idx:1]
[id:foo, idx:2]
[id:foo, idx:1]

Output for the "cloned" version:

N E X T F L O W  ~  version 21.10.6
Launching `mutable.nf` [romantic_bhabha] - revision: 6df7bba35e
executor >  local (4)
[0a/3aed82] process > ECHO (1)  [100%] 2 of 2 ✔
[c3/029147] process > ECHO2 (1) [100%] 2 of 2 ✔
[id:test2, idx:2]
[id:test1, idx:1]
[id:foo, idx:2]
[id:foo, idx:1]

Environment

  • Nextflow version: 21.10.6
  • Java version: openjdk 11.0.13 2021-10-19
  • Operating system: Linux 5.15.15-76051515-generic #202201160435~1642693824~20.04~97db1bb-Ubuntu
  • Bash version: GNU bash, version 5.0.17(1)-release (x86_64-pc-linux-gnu)

Midnighter avatar Feb 18 '22 17:02 Midnighter

I've hit this problem as well, is there a workaround?

adamrtalbot avatar Mar 01 '22 12:03 adamrtalbot

The workaround is to clone the map as shown in the second example.

Midnighter avatar Mar 01 '22 14:03 Midnighter

Ah sorry, clearly not paying enough attention! Using .clone() has fixed it. For what it's worth, here are my details

  • Nextflow version: 21.10.6
  • Java version: openjdk 1.8.0
  • OS: Ubuntu 18.04.6 LTS
  • BASH: 4.4.20

adamrtalbot avatar Mar 01 '22 16:03 adamrtalbot

Good thing to add to the docs under a section on troubleshooting or gotchas.

bentsherman avatar Jun 30 '22 22:06 bentsherman

Been thinking about this issue. Fundamentally the mutable object needs to be cloned, the question is whether Nextflow should do it automatically or the user should do it when they need it.

I am leaning towards the latter approach. For Nextflow to perform a deep copy every time a channel is referenced (i.e. implicit into from DSL1) would add needless overhead in most cases, and in the worst case might copy some large in-memory object (e.g. a CSV file loaded into memory) that either takes a while or causes an OOM error. Also, mutating state in this way kinda goes against the functional paradigm employed by Nextflow.

So I think the best solution is to just use a functional style that doesn't require mutation. Here is your example reworked to produce a new map without mutating or using clone() at all:

        ch_input.map { meta ->
            meta + [id: 'foo']
        }

See this article

bentsherman avatar Aug 10 '22 17:08 bentsherman

From the linked article:

The easiest way to merge two maps in Groovy is to use + operator. This method is straightforward - it creates a new map from the left-hand-side and right-hand-side maps.

that means it is equivalent to using clone and the modifying the state. On one hand, style-wise, I agree that your suggestion is nice since it can modify multiple keys in one statement. On the other hand, it may not be immediately obvious to every reader of the code that values of the LHS for keys in RHS get overwritten by RHS' values.

Personally, I would then probably prefer to implement a 'merge' function with explanatory docstring just as in the article and use that.

Midnighter avatar Aug 14 '22 17:08 Midnighter

Hey guys, just want to note in this thread as well that .clone() does not work on maps that carry a groupKey.

This works:

#!/usr/bin/env nextflow
nextflow.enable.dsl = 1

/*
 * Defines the pipeline inputs parameters
 */

Channel
    .from( [ "sample_name":"sample_A" ], [ "sample_name":"sample_B" ], [ "sample_name":"sample_C" ] )
    .set { samples_ch }

Channel
    .from( ["scaffolds", "file_s"], ["contigs", "file_c"])
    .set { genomes_ch }

samples_ch
    .combine( genomes_ch )
    .map { sample_info, genome_name, index -> 
            def tmp = sample_info.clone()
            tmp.genomeName = genome_name
            return [tmp, tuple(genome_name), index] }
    .view()

This does not work:

#!/usr/bin/env nextflow
nextflow.enable.dsl = 1

/*
 * Defines the pipeline inputs parameters
 */

Channel
    .from( [ "sample_name":"sample_A" ], [ "sample_name":"sample_B" ], [ "sample_name":"sample_C" ] )
    .set { samples_ch }

Channel
    .from( ["scaffolds", "file_s"], ["contigs", "file_c"])
    .set { genomes_ch }

samples_ch
    .combine( genomes_ch )
    .map { sample_info, genome_name, index -> tuple(groupKey(sample_info, 1), genome_name, index )}
    .map { sample_info, genome_name, index -> 
            def tmp = sample_info.clone()
            tmp.genomeName = genome_name
            return [tmp, tuple(genome_name), index] }
    .view()

diego-rt avatar Aug 15 '22 18:08 diego-rt

@Midnighter I think the + syntax will work great for most people, but it's yet another example of how I think the Nextflow docs just need to have more comprehensive docs on Groovy syntax, instead of only linking to other sources.

@diego-rt Would the + syntax I showed above solve your problem without having to use clone()?

bentsherman avatar Aug 16 '22 13:08 bentsherman

Actually yes, the + syntax seems to work just fine. It doesn't seem to preserve the groupKey of the object though.

Btw, also another thing that seems to be missing AFAIK is a way to remove the groupKey from an object.

diego-rt avatar Aug 18 '22 11:08 diego-rt

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jan 16 '23 01:01 stale[bot]

In order to keep this alive, just wanted to comment that I've started using the + operator more and more @bentsherman . I've grown to like it a lot. It's very concise yet explicit about what's happening.

Midnighter avatar Jan 16 '23 10:01 Midnighter