differentialabundance icon indicating copy to clipboard operation
differentialabundance copied to clipboard

multi-config resume behaviour

Open suzannejin opened this issue 5 months ago • 5 comments

Hello @pinin4fjords , @mirpedrol, @JoseEspinosa , @grst , @atrigila Here I open an issue discussing how multi-config should be implemented without disrupting the behaviour of resume of Nextflow.

Basically, there are trade-offs for each implementation. Discussing with @mirpedrol , our current preference is alternative 1, but let me know your thoughts.

Current behaviour

If we have the following paramsheet:

set1: same params for modules before deseq2 -- same for deseq2 -- set 1 params for modules after deseq2
set2: same params for modules before deseq2 -- same for deseq2 -- set 2 params for modules after deseq2
set3: same params for modules before deseq2 -- same for deseq2 -- set 3 params for modules after deseq2

And we run all of them with --paramset all. We want to compute deseq2 and the steps before only one unique time.

To do so, our current implementation will do a grouping of the channel with 3 entries to a channel with 1 entry. This channel will have a simplified meta, containing only the name of the sets sharing the same configs, and the configs up to the given module to run:

[
  paramset_names:[set1, set2, set3], 
  params: [all the params before and during deseq2]
]

Problem

Now if we want to run set1 with slightly changed paramset after deseq2, and a set4:

set1: same params for modules before deseq2 -- deseq2 -- set 1 params for modules after deseq2, but slightly changed
set4: same params for modules before deseq2 -- deseq2 -- set 4 params for modules after deseq2

We will have as meta:

[
  paramset_names:[set1, set4], 
  params: [all the params before and during deseq2]
]

Because of the change from [set1,set2,set3] to [set1,set4], Nextflow will not recognize it properly and will not resume the computation performed for set1.

Neither when we run --paramset set1 alone.

Alternative 1: a trade-off

Instead of avoiding recomputation of modules while running multiple configs, we ensure resume works well for each individual config.

This means we will not group the channels. So, for the case of running set1,set2,set3, we will have the following meta:

[paramset_name:set1, params: [all the params before and during deseq2]]
[paramset_name:set2, params: [all the params before and during deseq2]]
[paramset_name:set3, params: [all the params before and during deseq2]]

Pros: This will ensure individual sets will be properly resumed. Just like the standard behaviour of Nextflow. In this way, when we run --paramset set1 or --paramset set1,set2, -resume will properly recapitulate each of the individual sets that were run before. Cons: when multiple running different paramsets in parallel, modules sharing the same configurations will be computed multiple times.

Alternative 2

So we do not provide the paramset name to the module. We only provide the params before and during module as meta. Then, resume will work well, the modules sharing same params will be computed only one unique time.

However, problem: how do we store the ouput? Currently the output are stored as outdir/paramset/id.csv If the module do not know the name of the paramset, it cannot do that.

Possible solution is to hash the params provided to the module, and use this as part of the output name, such as outdir/id_methodxx_hashxxx.csv And then save the hash information to a table:

set1: hash1,hash2,hash3
set2: hash1,hash2,hash4
...

But this makes it more difficult for the user to look for the results they want.

suzannejin avatar Jun 10 '25 10:06 suzannejin

I think this partly depends on the frequency of each use case, and I would not necessarily be in favour of reverting the effort we made to prevent re-computation across paramsets to support the use case of repeated edits and re-runs.

Why don't we just remove paramset_names from the meta before we pass to the process? We can store the mapping to sample sets and re-associated after?

pinin4fjords avatar Jun 19 '25 14:06 pinin4fjords

Why don't we just remove paramset_names from the meta before we pass to the process? We can store the mapping to sample sets and re-associated after?

How would you re-associate it after? I was thinking about using hashes. But naming the files with hash removes readability.

An alternative way is to not store them immediately after running the modules. These files will still be produced as id_hashxxx.csv. But then, we can gather them and put them in a final module that simply maps the hashes with the paramset names and rename them. The cons of this is that the files will be accesible for the user only at the end of running the pipeline. Not during.

suzannejin avatar Jun 20 '25 08:06 suzannejin

It is possible to build groovy objects that have properties that are ignored by nextflow caching. Would it work to put the paramset names in such an object?

in lib/datacontainer.groovy:

class DataContainer {
    // Fields
    def a // cache
    def b // cache
    def c // cache ignore
    
    DataContainer (a, b, c) {
        this.a = a
        this.b = b
        this.c = c
    }
    
    // Nextflow relies on hashCode for resuming. 
    // by not including `c` here, we can make nextflow ignore this property during resume
    int hashCode() {
        return Objects.hash(a, b)
    }
}

grst avatar Jun 20 '25 13:06 grst

@grst 's idea isn't a bad one! Worth a try, might be the tidiest way of doing this if it works.

How would you re-associate it after? I was thinking about using hashes. But naming the files with hash removes readability.

I meant that we could have a channel that just stores the mapping between paramset_names and [all the params before and during deseq2]. Then we could delete paramset_names from the map before the differential processs, and re-add it after. But yes, that doesn't solve the issue of the storage identifier. That being the case I don't really see a way around storing files by some group ID, and the hash isn't a bad call.

If we don't go the groovy object way, I actually wonder if the new channel-driven output syntax could be useful here? https://www.nextflow.io/docs/latest/workflow.html#workflow-outputs

pinin4fjords avatar Jun 23 '25 16:06 pinin4fjords

Actually the channel based way of saving output could still be interesting even the groovy objects solve caching issue. In this way, it also provides a way to save the files in a more consistent way.

Let's say if we have [paramset1, paramset2], some of the steps overlap, some others no. If we save everything based on channel, with corresponding paramset id, then we can structure the output directory with something like: /paramset1/table/deseq2/differential_result.csv same as /paramset2/table/deseq2/differential_result.csv and then /paramset1/table/gsea/gsea_result.csv /paramset2/table/gprofiler2/gprofiler2_result.csv Then users will have all the outputs for the same paramset all in one place.

Instead of the current version in multi-config pr: /table/deseq2/paramset1,paramset2/differential_result.csv /table/gsea/paramset1/gsea_result.csv /table/gprofiler2/paramset2/gprofiler2_result.csv

suzannejin avatar Oct 24 '25 11:10 suzannejin