modules
modules copied to clipboard
Add nextdenovo to nf-core modules.
PR checklist
Closes #XXX
- [ ] This comment contains a description of changes (with reason).
- [ ] If you've fixed a bug or added code that should be tested, add tests!
- [x] If you've added a new tool - have you followed the module conventions in the contribution docs
- [ ] If necessary, include test data in your PR.
- [ ] Remove all TODO statements.
- [x] Emit the
versions.ymlfile. - [ ] Follow the naming conventions.
- [ ] Follow the parameters requirements.
- [ ] Follow the input/output options guidelines.
- [ ] Add a resource
label - [x] Use BioConda and BioContainers if possible to fulfil software requirements.
- Ensure that the test works with either Docker / Singularity. Conda CI tests can be quite flaky:
- For modules:
- [x]
nf-core modules test <MODULE> --profile docker - [ ]
nf-core modules test <MODULE> --profile singularity - [x]
nf-core modules test <MODULE> --profile conda
- [x]
- For subworkflows:
- [ ]
nf-core subworkflows test <SUBWORKFLOW> --profile docker - [ ]
nf-core subworkflows test <SUBWORKFLOW> --profile singularity - [ ]
nf-core subworkflows test <SUBWORKFLOW> --profile conda
- [ ]
- For modules:
I'm concerned that this test is failing on docker only
Maybe it is failing in docker because docker is somehow limiting memory for the parallel tasks?
Hello @elmedjadjirayane, this seems to be somewhat stale, are there things blocking you that you need help resolving?
Hello @elmedjadjirayane, this seems to be somewhat stale, are there things blocking you that you need help resolving?
Hello, I was away these last days. For now there is the issue with docker that I don’t understand because when I test locally, the test passes but here it fails, not sure how to debug this to be honest.
I would suggest to modify the test config to run less tasks in parallel and see if that helps with the OOM error in docker tests.
@nschan thanks! I think it worked?
Nice, that seems to have solved the test problem.
It has been a while since that started, and I am not sure what we decided on in the end; I think the we had agreed that configs for the module are to be provided by the pipeline / person using the module is that correct?
It has been a while since that started, and I am not sure what we decided on in the end; I think the we had agreed that configs for the module are to be provided by the pipeline / person using the module is that correct?
Yes, I think we had agreed to that!
Ok, maybe it would be good to provide some example config, to me it seems like generating a config (e.g. putting in the correct number of cpu) correctly is some effort and if you have solved that already maybe it would be good to have that included somewhere?
Where do we need to provide that? Would it be the test config we are using or maybe you mean something else?
I am also not completely sure what would be the best way, there are other tools with configs, for example https://github.com/nf-core/modules/tree/master/modules/nf-core/getorganelle seems to have process to generate configs. My main concern is, but maybe I am wrong, for something like ${task.cpus} to actually have the correct value the file will need to be echo'ed or something? The other concern is to avoid making every person using that module go hunt for the default config to copy somewhere into the folder. I think you could also use $moduleDir to place a config inside the module dir, or some subdirectory of it.
It was suggested by @mahesh-panchal to essentially dump a yaml from parameters, which seems like an easy and also transparent solution to me (see here https://github.com/nf-core/modules/blob/master/modules/nf-core/quartonotebook/main.nf). Let me know if that makes sense to you.
Do you mean that the entity using the module will not have to provide that file? I don't understand how we would generate it automatically. I think the idea behind this file is to avoid typing a lot of parameters in a command line and regroup them in one file that we can modify. I am sorry maybe I am not seeing it from the right angle.
The issue that I see is: How can resources be adjusted dynamically upon retries, if the file is provided? If the file contains ${task.cpus} i dont think that this will be subtituted with the variable value. What you could do (I am not very good at groovy there is probably a smarter way) is something like this:
script:
nextdonovo_parameters = [
job_type: "local",
job_prefix: "nextDenovo",
task: "all",
rewrite: "yes",
deltmp: "yes",
parallel_jobs: "${task.cpus}",
input_type: "raw",
read_type: "ont",
input_fofn: "input.fofn",
workdir: "${meta.id}_work",
read_cutoff: 1000,
genome_size: "3g",
sort_options: "-m 50g -t 30",
minimap2_options_raw: "-t 8",
pa_correction: 5,
correction_options: "-p 30",
minimap2_options_cns: "-t 8",
nextgraph_options: "-a 1",
]
def yamlBuilder = new groovy.yaml.YamlBuilder()
yamlBuilder(nextdonovo_parameters)
def yaml_content = yamlBuilder.toString().tokenize('\n').join("\n ").replace(":", " =").replace('"','')
"""
cat <<- END_YAML_PARAMS > nextdenovo.yml
${yaml_content}
END_YAML_PARAMS
"""
this creates a file with this content (meta.id was "test"):
---
job_type = local
job_prefix = nextDenovo
task = all
rewrite = yes
deltmp = yes
parallel_jobs = 1
input_type = raw
read_type = ont
input_fofn = input.fofn
workdir = test_work
read_cutoff = 1000
genome_size = 3g
sort_options = -m 50g -t 30
minimap2_options_raw = -t 8
pa_correction = 5
correction_options = -p 30
minimap2_options_cns = -t 8
nextgraph_options = -a 1
which would allow to simply edit the config parameters in the script.
I think this is still not optimal, ideally all of these would go into params for this module, so they can be edited without having to modify the module file? If this is part of the script file, everyone would need to patch main.nf for this script in their pipeline.
I think this could go into nextflow.config in the module dir:
params {
nextDenovo_job_tyoe = local
nextDenovo_job_prefix = nextDenovo
...
}
and the script would become:
script:
nextdonovo_parameters = [
job_type: params.nextDenovo_job_type,
job_prefix: params.nextDenovo_job_prefix,
.
]
def yamlBuilder = new groovy.yaml.YamlBuilder()
yamlBuilder(nextdonovo_parameters)
def yaml_content = yamlBuilder.toString().tokenize('\n').join("\n ").replace(":", " =").replace('"','')
"""
cat <<- END_YAML_PARAMS > nextdenovo.yml
${yaml_content}
END_YAML_PARAMS
"""
Does this make sense?
Thank you for those clarifications. I see what you mean. So we would have a config in the module dir and if the user wants to change on the parameters they will have to change that in the module dir? instead of giving it as an input each time they run it ?
If we want to be able to dynamically adjust resources for failed tasks, the user is unable to provide a new config file between attempts and the setting of the number of tasks has to be done somewhat dynamically based on task.cpus in the script.
So, what works would be this in a file nextdenovo.config in the module dir:
params {
nextDenovo_job_type = "local"
nextDenovo_job_prefix = "nextDenovo"
nextDenovo_task = "all"
nextDenovo_rewrite = "yes"
nextDenovo_deltmp = "yes"
nextDenovo_input_type = "raw"
nextDenovo_read_type = "ont"
nextDenovo_input_fofn = "input.fofn"
nextDenovo_read_cutoff = 1000
nextDenovo_genome_size = "3g"
nextDenovo_sort_options = "-m 50g -t 30"
nextDenovo_minimap2_options_raw = "-t 8"
nextDenovo_pa_correction = 5
nextDenovo_correction_options = "-p 30"
nextDenovo_minimap2_options_cns = "-t 8"
nextDenovo_nextgraph_options = "-a 1"
}
this in the pipeline nextflow.config:
includeConfig 'modules/nf-core/nextdenovo/nextdenovo.config'
and this in the script:
nextdonovo_parameters = [
job_type: params.nextDenovo_job_type,
job_prefix: params.nextDenovo_job_prefix,
task: params.nextDenovo_task,
rewrite: params.nextDenovo_rewrite,
deltmp: params.nextDenovo_deltmp,
parallel_jobs: "${task.cpus}",
input_type: params.nextDenovo_input_type,
read_type: params.nextDenovo_read_type,
input_fofn: params.nextDenovo_input_fofn,
workdir: "${meta.id}_nxtd_work",
read_cutoff: params.nextDenovo_read_cutoff,
genome_size: params.nextDenovo_genome_size,
sort_options: params.nextDenovo_sort_options,
minimap2_options_raw: params.nextDenovo_minimap2_options_raw,
pa_correction: params.nextDenovo_pa_correction,
correction_options: params.nextDenovo_correction_options,
minimap2_options_cns: params.nextDenovo_minimap2_options_cns,
nextgraph_options: params.nextDenovo_nextgraph_options
]
def yamlBuilder = new groovy.yaml.YamlBuilder()
yamlBuilder(nextdonovo_parameters)
def yaml_content = yamlBuilder.toString().tokenize('\n').join("\n ").replace(":", " =").replace('"','')
"""
cat <<- END_YAML_PARAMS > nextdenovo.yml
${yaml_content}
END_YAML_PARAMS
"""
things like ${task.cpus} cannot be stuffed into params that way I think because they get assigned later, when params are already written.
If we want to be able to dynamically adjust resources for failed tasks, the user is unable to provide a new config file between attempts and the setting of the number of tasks has to be done somewhat dynamically based on
task.cpusin the script.
Ah yes you are right, the issue is with the parallel jobs parameters then? which we set here echo "parallel_jobs = ${task.cpus}" >> conf.cfg. I think we can come up with a solution that only include this parameter if it is the only causing issues, the file can have a lot of optional parameters and I am not sure whether it's right to include all of them in the script?
I think there are more parameters that need to be set during the pipeline run: genome_size seems important, read_type too? If nextdenovo uses some defaults when there is no config file it might be ok to only echo some into the config, but I think with assemblers there are many parameters that may need to be adjusted somehow. How would you adjust those during a pipeline run (with heterogenous samples) if you can provide 1 config file?
I see, parameters are specific to a sample. Almost all parameters have default values but those default values aren't always suitable. I am not really experienced with nextflow so I think I will take your advice on this haha.
I am not sure if my advice is the best strategy, please take a look at some modules in this repo that have complex configurations and think about which is the best way to go for this particular tool. The modules should be written in such a way that they can be used as broadly as possible. In many modules this is done via command line args, and can become quite complex.
Groovy has a TomlBuilder class btw: https://groovy-lang.org/processing-toml.html
You could have it such that some defaults could be supplied from a file or map ( param ). The content is converted to a map outside the process, which can then be augmented meta map outside the process if necessary and within the module for the resource parameters. The map is then written out using the TomlBuilder.
So the process input expects a val object which should be a map of settings. In the process, the cpus, etc get added/overwritten and then it's written to a toml.
Thanks @mahesh-panchal, using the correct builder is a much better approach than trying to use replace to make the yaml into a toml, however I am unable to use groovy.toml.TomlBuilder() in nextflow (cause: unable to resolve class groovy.toml.TomlBuilder) have you done that before?
You're right. It doesn't appear to be available. It seems it's still an incubating feature of Groovy 4.
The tidiest solution might be to write out a yaml and use something like dasel to convert it to toml. That way the files are still human readable and debuggable.
That would add a dependency for dasel I guess.
Based on your suggestion of using a meta map, here is what I came up with:
script:
nextdonovo_parameters = [
job_type: meta.nextDenovo.job_type ?: "local",
job_prefix: meta.nextDenovo.job_prefix ?: "nextDenovo",
task: meta.nextDenovo.task ?: "all",
rewrite: meta.nextDenovo.rewrite ?: "yes",
deltmp: meta.nextDenovo.deltmp ?: "yes",
parallel_jobs: meta.nextDenovo.parallel_jobs ?: "${task.cpus}",
input_type: meta.nextDenovo.input_type ?: "raw",
read_type: meta.nextDenovo.read_type ?: error('Please provide the read type for nextDenovo'),
input_fofn: meta.nextDenovo.input_fofn ?: "input.fofn",
workdir: meta.nextDenovo.workdir ?: "${meta.id}_nxtd_work",
read_cutoff: meta.nextDenovo.read_cutoff ?: "1k",
genome_size: meta.nextDenovo.genome_size ?: "1g",
sort_options: meta.nextDenovo.sort_options ?: "-m ${task.memory.toGiga()}g -t ${task.cpus}",
minimap2_options_raw: meta.nextDenovo.minimap2_options_raw ?: "-t ${task.cpus}",
pa_correction: meta.nextDenovo.pa_correction ?: "3",
correction_options: meta.nextDenovo.correction_options ?: "-p 15",
minimap2_options_cns: meta.nextDenovo.minimap2_options_cns ?: "-t ${task.cpus}",
nextgraph_options: meta.nextDenovo.nextgraph_options ?: "-a 1"
]
def yamlBuilder = new groovy.yaml.YamlBuilder()
yamlBuilder(nextdonovo_parameters)
def yaml_content = yamlBuilder.toString().tokenize('\n').join("\n ").replace(":", " =").replace('"','').replace("---\n ", "")
"""
cat <<- END_YAML_PARAMS > ${meta.id}_nextdenovo.yml
${yaml_content}
END_YAML_PARAMS
"""
This means that a nextDenovo map in meta is required (even if empty). As it currently it is, meta.nextDenovo also needs to contain read_type. I tested this with this channel:
Channel.of([[id: "test_1", nextDenovo:[read_type: "ont"]], "test_1"], [[id: "test_2", nextDenovo: [read_type: "hifi"]], "test_2"])
and that works, filling in defaults for everything. It errors if nextDenovo.read_type is not provided, which I think it should.
What do you think @elmedjadjirayane ?
Note: I dont kow if nextDenovo is strict about the toml sections / headers, if it is that would need to be added I guess.
I would avoid that giant map of meta parameters. Ideally, set some minimal defaults like Quartonotebook does. Then add the input map to the default map.
input:
...
val cfg_params // Map:
script:
def nextdonovo_parameters = [
job_type: "local",
job_prefix: "nextDenovo",
task: "all",
rewrite: "yes",
deltmp: "yes",
parallel_jobs: "${task.cpus}",
input_type: "raw",
read_type: "hifi",
input_fofn: "input.fofn",
workdir: "${meta.id}_nxtd_work",
read_cutoff: "1k",
genome_size: "1g",
sort_options: "-m ${task.memory.toGiga()}g -t ${task.cpus}",
minimap2_options_raw: "-t ${task.cpus}",
pa_correction: "3",
correction_options: "-p 15",
minimap2_options_cns: "-t ${task.cpus}",
nextgraph_options: "-a 1"
] + cfg_params
Anything that's duplicated in cfg_params will overwrite what's in the defaults.
Also it's against guidelines to define any meta fields in modules other than id, and single_end to allow developers flexibility.
Also if something is optional, then leave it out from the defaults, but make an assertion for fields that are required in some combination. It should then come in through the input map.
Thats likely a better approach, I guess I misunderstood regarding the meta map for those params. It makes sense to supply them as a separate channel. What about something that is required (read type maybe?)
You can either add it as a mandatory input, and/or test it's defined in the script:
input:
...
val read_type
script:
...
read_type: read_type
...
if( !nextdonovo_parameters.read_type ){
error ...
}
That'll make it explicit, but it could be part of the cfg_params and described in the meta.yml.