differentialabundance icon indicating copy to clipboard operation
differentialabundance copied to clipboard

unification of paramsheet and profile

Open suzannejin opened this issue 7 months ago • 14 comments

Description of feature

With the implementation of the multi-config paramsheet in #423 #422 #464, the profiles such as rnaseq, affy, etc become redundant. Also, it might be confusing to have -profile rnaseq and --paramset_name deseq2_rnaseq on another side from the user perspective.

Here I open a thread to discuss the best way to deal with this.

suzannejin avatar May 29 '25 08:05 suzannejin

Hey @pinin4fjords, as you mentioned in https://github.com/nf-core/differentialabundance/pull/443#discussion_r2091378271 it will be nice to start thinking about coherent ways of implementation that don't lead to confusion and redundance. And polish this aspect before the release.

One way could be to deprecate those profiles in favuor of the paramsheet. But we could discuss it here.

Also CC @mirpedrol, @JoseEspinosa, @grst

suzannejin avatar May 29 '25 08:05 suzannejin

Alternatively, we can do the following:

  1. Define each row in the current default paramsheet as a yaml config in conf.
  2. Define a function in utils that load these yaml configs into ch_paramset, when chosen by the user through --paramset config1,config2
  3. In the case when user provide a custom paramset, load this into ch_paramset instead.
  4. The rest will work the same as in #443

Then from the user perpective, changes will be minimal:

  • default config files still stay inside conf
  • easily call parallel runs from default confs through --paramset deseq2_rnaseq,limma_rnaseq
  • easily run all rows from custom paramsheet through --paramsheet_custom
  • cmd running from default confs only change from -profile docker,rnaseq to -profile docker --paramset deseq2_rnaseq

Also loop in @atrigila

suzannejin avatar May 30 '25 07:05 suzannejin

Thanks @suzannejin !

I thought a lot about this when you were prototyping your big PR. In an ideal world I would have wanted the whole thing to be implemented with profiles, but even after trying quite hard I couldn't figure out how to iterate internally over profiles.

I definitely thing we need to kill the technology-wise profiles. We can easily port the same settings to a line in the default paramsets so people will just do nextflow run ... --paramset rnaseq rather than nextflow run ... -profile rnaseq.

Open to persuasion, but I don't see any value in confusing the picture by retaining the config file set alongside the paramsets. The next release is going to be very 'breaking' anyway, so we shouldn't retain things unless they have clear value.

pinin4fjords avatar Jun 02 '25 09:06 pinin4fjords

Hey @pinin4fjords, I totally agree with you to not have config file set alongside the paramsets.

I was just suggesting on a possible format that would keep the default configs (currently rows in default paramsheet) organized in a way similar to the rest of nf-core pipelines. Of course, maybe not for now.

So, now we have one paramsheet.csv containing all the configs. Instead, we can have one yaml file per config inside conf. In other words, each row in the current default paramsheet will be one yaml file inside conf. Then, the paramsheet parser should be able to parse each of those yaml files and stack them together internally, in contrast to now reading one default paramsheet. And only when the user provide a custom paramsheet, we read it in the current way, all the rows.

So we will still have to do --paramset rnaseq_deseq,rnaseq_limma instead of -profile but, this should improve readability and some structural coherence with the rest of pipelines.

suzannejin avatar Jun 02 '25 13:06 suzannejin

OK, now I'm understanding better. So we ditch paramsheet.csv entirely? I like that. We'd basically be treating paramsets almost like profiles, which is a good thing.

One issue would be that someone wanting to supply their own paramsets would have to supply a bunch of paramset files.

I wonder if it would be easier to have one YAML:

setA:
  paramA: foo

setB:
  paramB: bar

Then, a user could just supply a single paramsets yml file to override the default we have in conf/. I get what you're trying to do to get closer to other workflows, but we're already way outside those bounds in your whole project, I don't think we gain much by trying to get closer artificially if it gives us more overheads.

pinin4fjords avatar Jun 02 '25 14:06 pinin4fjords

OK, now I'm understanding better. So we ditch paramsheet.csv entirely? I like that. We'd basically be treating paramsets almost like profiles, which is a good thing.

One issue would be that someone wanting to supply their own paramsets would have to supply a bunch of paramset files.

I was thinking having multiple default single yaml inside conf, and one multi-config if custom provided by user. But it is true that it may introduce unnecesary confusion too.

I wonder if it would be easier to have one YAML:

setA:
  paramA: foo

setB:
  paramB: bar

Then, a user could just supply a single paramsets yml file to override the default we have in conf/. I get what you're trying to do to get closer to other workflows, but we're already way outside those bounds in your whole project, I don't think we gain much by trying to get closer artificially if it gives us more overheads.

And yes, one single yaml will be the simplest solution. Then, it will be basically #464 and deprecate current single nextflow configs

suzannejin avatar Jun 02 '25 16:06 suzannejin

Hello @pinin4fjords @grst @mirpedrol @JoseEspinosa @caraiz2001 , I am finally coming back from a long stop due to thesis writing.

So, before we introduced some large changes to enable iteration of configs by passing the params through channel meta. This works, but then conf profiles should be removed to avoid redundancy and create confusion for users. I tried a pair of things to do so at #470.

Below you have a summary of how it works. Let me know what you think, and if it looks okay, I would start fixing the snapshots and other details to formalize the PR.

Default paramsheet in substitution of profiles

The default paramsheet.yaml file will be stored in the conf/ folder and will behave just like the current default profiles. And the included configs can be called by the user through nextflow run main.nf --paramset_name deseq2_rnaseq [...] for example. Or run in parallel different configs on the same data through nextflow run main.nf --paramset_name deseq2_rnaseq,limma_rnaseq [...]. This will enable the user to try multiple methods on their data as they prefer.

Image

Note that I tried to have the structure as similar as possible to the current config, with also the possibility to use include to add upstream parameters from another paramset, just like how includeConfig enables including upstream config parameters. Perhaps, I can name include to includeParamset for clarity, but you get the idea. This include functionality is enabled within the function that parses the paramsheet. It goes and search paramsheet file from the available scope: conf/ folder and the user provided --paramsheet

Custom paramsheet if needed

One can also provide a custom paramsheet. This is exemplified by the test_paramsheet.yaml:

Image

Note that the testdata is just a yaml file inside conf/ to avoid copy pasting the same paths all the time. Anyway, input files are part of the params scope that can be iterated now.

Then the user can run multiple custom paramsets in parallel, if they want, by:

nextflow run main.nf \
  --paramsheet conf/test_paramsheet.yaml \
  --paramset_name deseq2_rnaseq_gsea,deseq2_rnaseq_gsea_complex \
  [...]

or all for everything in their paramsheet:

nextflow run main.nf \
  --paramsheet conf/test_paramsheet.yaml \
  --paramset_name all \
  [...]

suzannejin avatar Oct 06 '25 14:10 suzannejin

Thank you @suzannejin, this all sounds pretty good. In an ideal world we'd just be iterating profiles, but since that's not a thing, making something that looks as close to that as possible is a good plan.

I'd say go for it and we'll have a full discussion on the PR when it's finalised.

pinin4fjords avatar Oct 10 '25 13:10 pinin4fjords

I like the suggestion!

I'm wondering if we should even support the parameter inheritance with include. On the one hand, it can be used to make the configuration modular, on the other hand

  • it makes it harder for users to reason about a paramset. For instance, in this example Image you have (at least!) three places you need to check to understand what parameters you might have in the resolved config.

  • it can become tricky to implement and a potential source of bugs. For instance, some possible edge cases:

    • long hierarchies: you inherit from one paramset that inherits from another
    • multi-inheritance: you inherit from multiple different profiles, what will take precedence?
    • to make it even more complex, say A inherits from B, D and B inherits from D. Now what takes precedence?
    • simple key-value pairs are straightforward to override, but maybe there could be lists or nested objects? Entire libraries have been written for that! e.g. hiyapyco.

grst avatar Oct 10 '25 14:10 grst

Hi @grst I see your concern. I wonder if there is an alternative though, maybe by limiting the scope of the include function or something? Cuz, without any include, I think it would be a hell for the users to copy paste all the parameters from the default "profiles" in this new paramsheet into their custom paramsheet.

suzannejin avatar Oct 22 '25 12:10 suzannejin

I think it would be a hell for the users to copy paste all the parameters from the default "profiles" in this new paramsheet into their custom paramsheet.

Is it so bad though? I'd imagine we'd have a predefined profile for, say, "RNAseq with DESeq2". If someone wants to adjust that they'd have to copy from a single entry only.

grst avatar Oct 23 '25 05:10 grst

I really think the inclusion functionality is important, otherwise you end up duplicating a bunch of stuff.

pinin4fjords avatar Oct 23 '25 13:10 pinin4fjords

DRY isn't absolute, I really think it needs to be balanced against additional complexity.

Can we come up with an example what would such a config file realistically look like with or without inclusion mechanism? And if it turns out to be a big benefit can there be constraints, e.g. only key-value pairs supported, no multiple inheritance, etc.?

grst avatar Oct 23 '25 15:10 grst

Hello @pinin4fjords @grst @SusiJo @mirpedrol

So #470 has all the main implentations and changes needed for it to work, and enable parallel running of multiple analytical paths. All the tests pass now.

I have also tried in #548 to simplify the paramsheet include mechanism to remove multi-inheritance, but long hierarchies are still allows. I think this is fine.

In #547 there is the version without include functionality at all. But myself I found it hard to work copy-pasting all the time. Indeed, I ended up breaking some tests, because when copy pasting I may copied from the wrong side, etc.

We could further discuss about this during the hackathon :)

suzannejin avatar Oct 27 '25 13:10 suzannejin