flowcraft
flowcraft copied to clipboard
Add optional extra params to processes
An option to add extra params that we currently do not contemplate in the process is necessary as it will allow users to customise the process.
A possible solution is to add this options in the self.params() of each process:
self.params = {
"krakenDB": {
"default": "'minikraken_20171013_4GB'",
"description": "Specifies kraken database."
},
"krakenExtra": {
"default": "",
"description": "Add any extra params"
}
}
Note that the name of the param will have to be indicative of which process it's referring to. A way to overcome this is to create a new attribute self.extraParam but it might over complicate things. The solution presented above is simple although a bit laborious.
What is the difference between self.parms and self.extraparms? it is not clear too me why we would need a different attribute instead of just adding any additional parameters .
Some executions may require further parameters than the ones described in the flowcraft process. For instance bowtie2 has a lot of parameters that aren't available in our current flowcraft processes. When the program has default values for each parameter the issue is reduced but when the parameter is just true or false with no default behavior this gets more complicated.
For example:
my_fancy.py --foo 0
but now I want a second parameter to be added:
my_fancy.py --foo --bar
The --bar
param is not defined in the flowcraft process and thus I will not be able to use, unless I edit the generated nextflow script. With some kind of extra parameters option, either the one that is currently possible or the one suggested by @cimendes , it will be possible to add to this nextflow param a string containing all my extra parameters inside it.
I agree that this is something that needs to be addressed and added to flowcraft. Not only the ability to provide extra inputs, but also to specify different parameters for the same component when it exists multiple times in a pipeline (e.g., in a fork).
Currently, the problem is that if two or more components have the same parameter, this parameter will appear only once in params.config
and it will be the same for all components that use it (unlike process directives, which are always process-specific). This problem goes all the way back to the nextflow templates themselves, where params
are used like:
// abricate template
input:
set sample_id, file(assembly) from {{ input_channel }}
each db from params.abricateDatabases
If there are two or more abricate components in the pipeline, they will all use the same params.abricateDatabases
.
A simple solution would be to use the pid
placeholder in the parameters:
// abricate template
input:
set sample_id, file(assembly) from {{ input_channel }}
each db from params.abricateDatabases_{{pid}}
Which would result in a params.config
like:
params {
abricateDatabases_2_3 = (...)
abricateDatabases_2_4 = (...)
}
However, we would have to change most of the existing templates, and add a new rule for using parameters when building new processes. This system would then be used both for regular params and for extra params.
If you have any other suggestions, please feel free.
Recently I've ran into a similar issue where the remove_host and patlasMapping share the same parameter name and input channel. Currently, flowcraft overrides the the remove_host parameter with the patlas one. Me and @tiagofilipe12 have been discussing what's the best solution that can fix both this issue and the extra parameters one. The only way we see that there is no conflict when assigning a parameter to a process is to include both the process and and its %pid, like what @ODiogoSilva suggests above: {{process}}_paramName_{{pid}}
This is an urgent change as right now all parameters must have an unique name, even when they relate to the same software.
I think this could be done using only the existing {{pid}}
variable in the templates, but with a small change in the structure of params.config
and the output of the --help
option.
Instead of the parameters appearing as:
params {
adapters_1_4 = ...
abricateDatabases_2_8 = ...
}
which could be hard to read in larger pipelines with many parameters, we could simply:
params {
// FastQC 1_4
adapters_1_4 = ...
// Abricate 2_8
abricateDatabases_2_8 = ...
}
And a similar structure in the --help
. With this configuration, the only changes we would need to do is to append the {{pid}}
to each parameter in the templates and change the engine in the _get_params_string
and _update_secondary_inputs
methods to use <param>_<pid>
It requires the user to consult the DAG to know which process the pid is referring to, but the same would happen in processes with the same name (in a fork for example).
I have added the system for the unique parameters per component (unique_params
branch) and the corresponding documentation.
The biggest changes are:
- Addition of the
{{ param_id }}
placeholder in the nextflow templates, - The deprecation of the
secondary_inputs
attribute (Channel creation from params can be done directly in the nextflow template) - A new build option
--merge-params
.
By default, new pipelines will now be built with independent parameters for each component. If the --merge-params
option is used when building the pipeline, identical parameters across components will be merged into the same (which is the current parameter system we have).