demultiplex icon indicating copy to clipboard operation
demultiplex copied to clipboard

MultiQC not resuming

Open apeltzer opened this issue 1 year ago • 11 comments

apeltzer avatar Aug 08 '24 13:08 apeltzer

unclear if this is intended or not --> verify

apeltzer avatar Aug 08 '24 13:08 apeltzer

Ask @fmalmeida what he had to do to make this work :)

apeltzer avatar Aug 11 '24 08:08 apeltzer

I thought there was a "don't cache" setting somewhere, and it was intended, but there's not. It happens on every nf-core pipeline...

@ewels Any thoughts on where this is coming from?

Might be better to move this to tools.

edmundmiller avatar Aug 11 '24 16:08 edmundmiller

Thought the same initially, but its not been set here. Not a major problem here anyways (and negligible runtime too, considering how much $$$ go into demuxing an entire flowcell ;-)).

apeltzer avatar Aug 12 '24 07:08 apeltzer

I woudn't say the runtime is negligible... on a recent large flow cell, multiqc ran for ~1h (not sure how much time was wasted on staging-in files though).

I also never got why one would intentionally not resume multiqc...

grst avatar Aug 12 '24 07:08 grst

Hey hey hey, The main thing that makes the MultiQC module do not cache is the cache = false that sometimes is added as @edmundmiller mentioned, but mainly the fact that many run-specific variable metadata is added to the MultiQC Summary Map wich makes this input-map of metadata always different for every run, and thus, never caching, see here:

https://github.com/nf-core/demultiplex/blob/master/lib/NfcoreTemplate.groovy#L72-L95

fmalmeida avatar Aug 12 '24 07:08 fmalmeida

This means that its not so easy to adapt this without changing the workflow_summary_mqc.yaml and methods_description_mqc.yaml by changing whats ingested into these two YAML files as there are some variables that contain timestamps and thus are updated on any resume. To be more explicit lets close this ticket, enable caching = false in the conf/modules.config for multiqc (so that users get what they think they will get) and leave it as is. If we at some point decide to take this on, I would suggest we can still do this in a next / patch release. Thanks for your points @fmalmeida :)

apeltzer avatar Aug 12 '24 13:08 apeltzer

I assessed this in the current dev branch (commit id: 892b9d8cc5beade252777428bd6df440dd874468). The main conflicting channel is ch_multiqc_files, which contains two files that are different with each execution: workfow_summary_mqc.yaml and methods_description_mqc.yaml.

These files are modified with each execution because they contain some data like timestamp of execution, runName, among others. In order to have multiqc resume we would need to:

  1. Change the collect operator for the ch_multiqc_files and add "sort: true".
  2. Update the content of the workflow_summary_mqc.yaml file to remove runName, or develop a rule so that it uses the same runName as the previous execution if every other process was ran from cache.
  3. Update the methods_description_mqc.yaml file so that it doesn't contain runName, timestamp, and any other value that changes with execution, or use a similar rule as for workflow_summary_mqc.yaml.

nschcolnicov avatar Aug 12 '24 13:08 nschcolnicov

Thanks for the analysis... If this is to be changed, then it should happen at the pipeline template level in nf-core/tools.

grst avatar Aug 12 '24 13:08 grst

Added it: https://github.com/nf-core/demultiplex/pull/239

nschcolnicov avatar Aug 12 '24 14:08 nschcolnicov

I will file an issue there and we can take it up once this has been agreed upon in the wider community - will x-ref this ticket here so we can take it up once there was a decision in the community... :) See this one: https://github.com/nf-core/tools/issues/3110

apeltzer avatar Aug 12 '24 14:08 apeltzer