modules icon indicating copy to clipboard operation
modules copied to clipboard

Add nextdenovo to nf-core modules.

Open elmedjadjirayane opened this issue 6 months ago • 31 comments

PR checklist

Closes #XXX

  • [ ] This comment contains a description of changes (with reason).
  • [ ] If you've fixed a bug or added code that should be tested, add tests!
  • [x] If you've added a new tool - have you followed the module conventions in the contribution docs
  • [ ] If necessary, include test data in your PR.
  • [ ] Remove all TODO statements.
  • [x] Emit the versions.yml file.
  • [ ] Follow the naming conventions.
  • [ ] Follow the parameters requirements.
  • [ ] Follow the input/output options guidelines.
  • [ ] Add a resource label
  • [x] Use BioConda and BioContainers if possible to fulfil software requirements.
  • Ensure that the test works with either Docker / Singularity. Conda CI tests can be quite flaky:
    • For modules:
      • [x] nf-core modules test <MODULE> --profile docker
      • [ ] nf-core modules test <MODULE> --profile singularity
      • [x] nf-core modules test <MODULE> --profile conda
    • For subworkflows:
      • [ ] nf-core subworkflows test <SUBWORKFLOW> --profile docker
      • [ ] nf-core subworkflows test <SUBWORKFLOW> --profile singularity
      • [ ] nf-core subworkflows test <SUBWORKFLOW> --profile conda

elmedjadjirayane avatar May 16 '25 12:05 elmedjadjirayane

I'm concerned that this test is failing on docker only

maxulysse avatar May 26 '25 15:05 maxulysse

Maybe it is failing in docker because docker is somehow limiting memory for the parallel tasks?

nschan avatar Jun 10 '25 08:06 nschan

Hello @elmedjadjirayane, this seems to be somewhat stale, are there things blocking you that you need help resolving?

nschan avatar Jun 18 '25 09:06 nschan

Hello @elmedjadjirayane, this seems to be somewhat stale, are there things blocking you that you need help resolving?

Hello, I was away these last days. For now there is the issue with docker that I don’t understand because when I test locally, the test passes but here it fails, not sure how to debug this to be honest.

elmedjadjirayane avatar Jun 18 '25 10:06 elmedjadjirayane

I would suggest to modify the test config to run less tasks in parallel and see if that helps with the OOM error in docker tests.

nschan avatar Jun 18 '25 11:06 nschan

@nschan thanks! I think it worked?

elmedjadjirayane avatar Jun 23 '25 07:06 elmedjadjirayane

Nice, that seems to have solved the test problem.

nschan avatar Jun 23 '25 07:06 nschan

It has been a while since that started, and I am not sure what we decided on in the end; I think the we had agreed that configs for the module are to be provided by the pipeline / person using the module is that correct?

nschan avatar Jun 23 '25 07:06 nschan

It has been a while since that started, and I am not sure what we decided on in the end; I think the we had agreed that configs for the module are to be provided by the pipeline / person using the module is that correct?

Yes, I think we had agreed to that!

elmedjadjirayane avatar Jun 23 '25 07:06 elmedjadjirayane

Ok, maybe it would be good to provide some example config, to me it seems like generating a config (e.g. putting in the correct number of cpu) correctly is some effort and if you have solved that already maybe it would be good to have that included somewhere?

nschan avatar Jun 23 '25 07:06 nschan

Where do we need to provide that? Would it be the test config we are using or maybe you mean something else?

elmedjadjirayane avatar Jun 23 '25 07:06 elmedjadjirayane

I am also not completely sure what would be the best way, there are other tools with configs, for example https://github.com/nf-core/modules/tree/master/modules/nf-core/getorganelle seems to have process to generate configs. My main concern is, but maybe I am wrong, for something like ${task.cpus} to actually have the correct value the file will need to be echo'ed or something? The other concern is to avoid making every person using that module go hunt for the default config to copy somewhere into the folder. I think you could also use $moduleDir to place a config inside the module dir, or some subdirectory of it.

nschan avatar Jun 23 '25 08:06 nschan

It was suggested by @mahesh-panchal to essentially dump a yaml from parameters, which seems like an easy and also transparent solution to me (see here https://github.com/nf-core/modules/blob/master/modules/nf-core/quartonotebook/main.nf). Let me know if that makes sense to you.

nschan avatar Jun 23 '25 09:06 nschan

Do you mean that the entity using the module will not have to provide that file? I don't understand how we would generate it automatically. I think the idea behind this file is to avoid typing a lot of parameters in a command line and regroup them in one file that we can modify. I am sorry maybe I am not seeing it from the right angle.

elmedjadjirayane avatar Jun 24 '25 07:06 elmedjadjirayane

The issue that I see is: How can resources be adjusted dynamically upon retries, if the file is provided? If the file contains ${task.cpus} i dont think that this will be subtituted with the variable value. What you could do (I am not very good at groovy there is probably a smarter way) is something like this:

script:
    nextdonovo_parameters = [
        job_type: "local",
        job_prefix: "nextDenovo",
        task: "all",
        rewrite: "yes",
        deltmp: "yes",
        parallel_jobs: "${task.cpus}",
        input_type: "raw",
        read_type: "ont",
        input_fofn: "input.fofn",
        workdir: "${meta.id}_work",
        read_cutoff: 1000,
        genome_size: "3g",
        sort_options: "-m 50g -t 30",
        minimap2_options_raw: "-t 8",
        pa_correction: 5,
        correction_options: "-p 30",
        minimap2_options_cns: "-t 8",
        nextgraph_options: "-a 1",
    ]
    def yamlBuilder = new groovy.yaml.YamlBuilder()
    yamlBuilder(nextdonovo_parameters)
    def yaml_content = yamlBuilder.toString().tokenize('\n').join("\n    ").replace(":", " =").replace('"','')
    """
    cat <<- END_YAML_PARAMS > nextdenovo.yml
    ${yaml_content}
    END_YAML_PARAMS
    """

this creates a file with this content (meta.id was "test"):

---
job_type = local
job_prefix = nextDenovo
task = all
rewrite = yes
deltmp = yes
parallel_jobs = 1
input_type = raw
read_type = ont
input_fofn = input.fofn
workdir = test_work
read_cutoff = 1000
genome_size = 3g
sort_options = -m 50g -t 30
minimap2_options_raw = -t 8
pa_correction = 5
correction_options = -p 30
minimap2_options_cns = -t 8
nextgraph_options = -a 1

which would allow to simply edit the config parameters in the script.

I think this is still not optimal, ideally all of these would go into params for this module, so they can be edited without having to modify the module file? If this is part of the script file, everyone would need to patch main.nf for this script in their pipeline. I think this could go into nextflow.config in the module dir:

params {
   nextDenovo_job_tyoe = local
   nextDenovo_job_prefix = nextDenovo
   ...
}

and the script would become:

script:
    nextdonovo_parameters = [
        job_type: params.nextDenovo_job_type,
        job_prefix: params.nextDenovo_job_prefix,
   .
    ]
    def yamlBuilder = new groovy.yaml.YamlBuilder()
    yamlBuilder(nextdonovo_parameters)
    def yaml_content = yamlBuilder.toString().tokenize('\n').join("\n    ").replace(":", " =").replace('"','')
    """
    cat <<- END_YAML_PARAMS > nextdenovo.yml
    ${yaml_content}
    END_YAML_PARAMS
    """

Does this make sense?

nschan avatar Jun 24 '25 08:06 nschan

Thank you for those clarifications. I see what you mean. So we would have a config in the module dir and if the user wants to change on the parameters they will have to change that in the module dir? instead of giving it as an input each time they run it ?

elmedjadjirayane avatar Jun 24 '25 08:06 elmedjadjirayane

If we want to be able to dynamically adjust resources for failed tasks, the user is unable to provide a new config file between attempts and the setting of the number of tasks has to be done somewhat dynamically based on task.cpus in the script.

nschan avatar Jun 24 '25 08:06 nschan

So, what works would be this in a file nextdenovo.config in the module dir:

params {
        nextDenovo_job_type = "local"
        nextDenovo_job_prefix = "nextDenovo"
        nextDenovo_task = "all"
        nextDenovo_rewrite = "yes"
        nextDenovo_deltmp = "yes"
        nextDenovo_input_type = "raw"
        nextDenovo_read_type = "ont"
        nextDenovo_input_fofn = "input.fofn"
        nextDenovo_read_cutoff = 1000
        nextDenovo_genome_size = "3g"
        nextDenovo_sort_options = "-m 50g -t 30"
        nextDenovo_minimap2_options_raw = "-t 8"
        nextDenovo_pa_correction = 5
        nextDenovo_correction_options = "-p 30"
        nextDenovo_minimap2_options_cns = "-t 8"
        nextDenovo_nextgraph_options = "-a 1"
}

this in the pipeline nextflow.config:

includeConfig 'modules/nf-core/nextdenovo/nextdenovo.config'

and this in the script:

nextdonovo_parameters = [
        job_type: params.nextDenovo_job_type,
        job_prefix: params.nextDenovo_job_prefix,
        task: params.nextDenovo_task,
        rewrite: params.nextDenovo_rewrite,
        deltmp: params.nextDenovo_deltmp,
        parallel_jobs: "${task.cpus}",
        input_type: params.nextDenovo_input_type,
        read_type: params.nextDenovo_read_type,
        input_fofn: params.nextDenovo_input_fofn,
        workdir: "${meta.id}_nxtd_work",
        read_cutoff: params.nextDenovo_read_cutoff,
        genome_size: params.nextDenovo_genome_size,
        sort_options: params.nextDenovo_sort_options,
        minimap2_options_raw: params.nextDenovo_minimap2_options_raw,
        pa_correction: params.nextDenovo_pa_correction,
        correction_options: params.nextDenovo_correction_options,
        minimap2_options_cns: params.nextDenovo_minimap2_options_cns,
        nextgraph_options: params.nextDenovo_nextgraph_options
    ]
    def yamlBuilder = new groovy.yaml.YamlBuilder()
    yamlBuilder(nextdonovo_parameters)
    def yaml_content = yamlBuilder.toString().tokenize('\n').join("\n    ").replace(":", " =").replace('"','')
    """
    cat <<- END_YAML_PARAMS > nextdenovo.yml
    ${yaml_content}
    END_YAML_PARAMS
    """

things like ${task.cpus} cannot be stuffed into params that way I think because they get assigned later, when params are already written.

nschan avatar Jun 24 '25 08:06 nschan

If we want to be able to dynamically adjust resources for failed tasks, the user is unable to provide a new config file between attempts and the setting of the number of tasks has to be done somewhat dynamically based on task.cpus in the script.

Ah yes you are right, the issue is with the parallel jobs parameters then? which we set here echo "parallel_jobs = ${task.cpus}" >> conf.cfg. I think we can come up with a solution that only include this parameter if it is the only causing issues, the file can have a lot of optional parameters and I am not sure whether it's right to include all of them in the script?

elmedjadjirayane avatar Jun 24 '25 08:06 elmedjadjirayane

I think there are more parameters that need to be set during the pipeline run: genome_size seems important, read_type too? If nextdenovo uses some defaults when there is no config file it might be ok to only echo some into the config, but I think with assemblers there are many parameters that may need to be adjusted somehow. How would you adjust those during a pipeline run (with heterogenous samples) if you can provide 1 config file?

nschan avatar Jun 24 '25 08:06 nschan

I see, parameters are specific to a sample. Almost all parameters have default values but those default values aren't always suitable. I am not really experienced with nextflow so I think I will take your advice on this haha.

elmedjadjirayane avatar Jun 24 '25 09:06 elmedjadjirayane

I am not sure if my advice is the best strategy, please take a look at some modules in this repo that have complex configurations and think about which is the best way to go for this particular tool. The modules should be written in such a way that they can be used as broadly as possible. In many modules this is done via command line args, and can become quite complex.

nschan avatar Jun 24 '25 09:06 nschan

Groovy has a TomlBuilder class btw: https://groovy-lang.org/processing-toml.html

You could have it such that some defaults could be supplied from a file or map ( param ). The content is converted to a map outside the process, which can then be augmented meta map outside the process if necessary and within the module for the resource parameters. The map is then written out using the TomlBuilder.

So the process input expects a val object which should be a map of settings. In the process, the cpus, etc get added/overwritten and then it's written to a toml.

mahesh-panchal avatar Jun 24 '25 10:06 mahesh-panchal

Thanks @mahesh-panchal, using the correct builder is a much better approach than trying to use replace to make the yaml into a toml, however I am unable to use groovy.toml.TomlBuilder() in nextflow (cause: unable to resolve class groovy.toml.TomlBuilder) have you done that before?

nschan avatar Jun 24 '25 12:06 nschan

You're right. It doesn't appear to be available. It seems it's still an incubating feature of Groovy 4.

The tidiest solution might be to write out a yaml and use something like dasel to convert it to toml. That way the files are still human readable and debuggable.

mahesh-panchal avatar Jun 24 '25 13:06 mahesh-panchal

That would add a dependency for dasel I guess. Based on your suggestion of using a meta map, here is what I came up with:

script:
    nextdonovo_parameters = [
        job_type: meta.nextDenovo.job_type ?: "local",
        job_prefix: meta.nextDenovo.job_prefix ?: "nextDenovo",
        task: meta.nextDenovo.task ?: "all",
        rewrite: meta.nextDenovo.rewrite ?: "yes",
        deltmp: meta.nextDenovo.deltmp ?: "yes",
        parallel_jobs: meta.nextDenovo.parallel_jobs ?: "${task.cpus}",
        input_type: meta.nextDenovo.input_type ?: "raw",
        read_type: meta.nextDenovo.read_type ?: error('Please provide the read type for nextDenovo'),
        input_fofn: meta.nextDenovo.input_fofn ?: "input.fofn",
        workdir: meta.nextDenovo.workdir ?: "${meta.id}_nxtd_work",
        read_cutoff: meta.nextDenovo.read_cutoff ?: "1k",
        genome_size: meta.nextDenovo.genome_size ?: "1g",
        sort_options: meta.nextDenovo.sort_options ?: "-m ${task.memory.toGiga()}g -t ${task.cpus}",
        minimap2_options_raw: meta.nextDenovo.minimap2_options_raw ?: "-t ${task.cpus}",
        pa_correction: meta.nextDenovo.pa_correction ?: "3",
        correction_options: meta.nextDenovo.correction_options ?: "-p 15",
        minimap2_options_cns: meta.nextDenovo.minimap2_options_cns ?: "-t ${task.cpus}",
        nextgraph_options: meta.nextDenovo.nextgraph_options ?: "-a 1"
    ]
    def yamlBuilder = new groovy.yaml.YamlBuilder()
    yamlBuilder(nextdonovo_parameters)
    def yaml_content = yamlBuilder.toString().tokenize('\n').join("\n    ").replace(":", " =").replace('"','').replace("---\n    ", "")
    """
    cat <<- END_YAML_PARAMS > ${meta.id}_nextdenovo.yml
    ${yaml_content}
    END_YAML_PARAMS
    """

This means that a nextDenovo map in meta is required (even if empty). As it currently it is, meta.nextDenovo also needs to contain read_type. I tested this with this channel:

Channel.of([[id: "test_1", nextDenovo:[read_type: "ont"]], "test_1"], [[id: "test_2", nextDenovo: [read_type: "hifi"]], "test_2"])

and that works, filling in defaults for everything. It errors if nextDenovo.read_type is not provided, which I think it should.

What do you think @elmedjadjirayane ?

Note: I dont kow if nextDenovo is strict about the toml sections / headers, if it is that would need to be added I guess.

nschan avatar Jun 24 '25 13:06 nschan

I would avoid that giant map of meta parameters. Ideally, set some minimal defaults like Quartonotebook does. Then add the input map to the default map.

    input:
    ...
    val cfg_params // Map:

    script:
    def nextdonovo_parameters = [
        job_type: "local",
        job_prefix: "nextDenovo",
        task: "all",
        rewrite: "yes",
        deltmp: "yes",
        parallel_jobs: "${task.cpus}",
        input_type: "raw",
        read_type: "hifi",
        input_fofn: "input.fofn",
        workdir: "${meta.id}_nxtd_work",
        read_cutoff: "1k",
        genome_size: "1g",
        sort_options: "-m ${task.memory.toGiga()}g -t ${task.cpus}",
        minimap2_options_raw: "-t ${task.cpus}",
        pa_correction: "3",
        correction_options: "-p 15",
        minimap2_options_cns: "-t ${task.cpus}",
        nextgraph_options: "-a 1"
    ] + cfg_params

Anything that's duplicated in cfg_params will overwrite what's in the defaults.

Also it's against guidelines to define any meta fields in modules other than id, and single_end to allow developers flexibility.

mahesh-panchal avatar Jun 24 '25 14:06 mahesh-panchal

Also if something is optional, then leave it out from the defaults, but make an assertion for fields that are required in some combination. It should then come in through the input map.

mahesh-panchal avatar Jun 24 '25 14:06 mahesh-panchal

Thats likely a better approach, I guess I misunderstood regarding the meta map for those params. It makes sense to supply them as a separate channel. What about something that is required (read type maybe?)

nschan avatar Jun 24 '25 14:06 nschan

You can either add it as a mandatory input, and/or test it's defined in the script:

input:
...
val read_type

script:
...
    read_type: read_type 
...
if( !nextdonovo_parameters.read_type ){
    error ...
}

That'll make it explicit, but it could be part of the cfg_params and described in the meta.yml.

mahesh-panchal avatar Jun 24 '25 14:06 mahesh-panchal