nextflow
nextflow copied to clipboard
Config parser (and loader)
Close #2723
This PR explores one possible solution to the config overhaul by implementing a custom parser for config files, using a similar approach as #4613.
The linked issue initially proposed adopting a new config format. However, most of the proposed alternatives do not have all of the capabilities that we need out of the box or have their own downsides. I think the actual problem is that the config file has been stretched to cover use cases for which it wasn't originally designed.
Between this PR, the module config (#4510), and moving the params definition from the config file to the params schema, I think this will relieve enough pressure from the config file that we can keep it.
This custom parser enforces a much simpler syntax:
- Only three statements are allowed: assignments, blocks, and config inclusion.
- The left-hand side of an assignment can only be a path expression (
foo.bar.baz ...
). - The right-hand side can be an arbitrary expression, including closures.
- Within a closure expression, arbitrary Groovy code is still allowed, although right now there is only a minimal subset in the parser.
As a result, the config parser is much simpler and the code is easier to read. There is no more dependence on metaclasses, instead there is a simple builder DSL for executing a config file and producing a map.
The vast majority of existing configuration is still supported, with only a few minor changes needed:
-
Include statements are sometimes wrapped in a try-catch to prevent a missing config file from failing the pipeline. For this we can either provide both
includeConfig
andrequireConfig
, or demote the error to a warning when the config file is remote -
Custom functions like
check_max
are no longer supported. Instead, I suggest that we import thelib
directory into config files as we do for scripts, so that these helper functions can simply be defined there -
Conditional statements are no longer supported. With module config they should no longer be needed
-
Params should be defined only in
nextflow_schema.json
to prevent issues like #2662 where a param is referenced before params are fully resolved. More of a best practice than a hard requirement. We can still support params defined in the config (likely with a warning), and overriding param default values in a profile is still a valid use case.
All of these issues can be resolved independently of this PR, in case we need to smooth the migration process.
It is still possible to execute arbitrary code with side effects, just not as easily as before. We can't really avoid this if we want to support dynamic expressions:
foo.bar = "${println 'pwned'}"
TODO:
- [ ] add custom lazy resolution of params
- [ ] review subset of Groovy that is supported within closures
- [ ] support
lib
directory for using helper functions in config files - [ ] explore Java security manager to prevent side effects in dynamic strings
Deploy Preview for nextflow-docs-staging ready!
Name | Link |
---|---|
Latest commit | a6d89cdeaa31b12df54bb7eb44ca419943c4b68a |
Latest deploy log | https://app.netlify.com/sites/nextflow-docs-staging/deploys/66157b72ef7761000824c06f |
Deploy Preview | https://deploy-preview-4744--nextflow-docs-staging.netlify.app |
Preview on mobile | Toggle QR Code...Use your smartphone camera to open QR code link. |
To edit notification comments on pull requests, go to your Netlify site configuration.
Research updates:
- Groovy compiler provides an interface to use a custom parser (in order to ease the transition from Antlr2 to Antlr4), this can be used to inject our custom parser without much hassle
- Config schema effort can be pursued independently of the parser, see #4201
- Might be able to use the Java security manager to prevent execution of unsafe code in the config file such as file I/O, subprocesses
Here's an example to illustrate how the config parser "executes" the config file. Given this config:
foo.bar = 'hello'
foo {
bar = 'hello'
}
includeConfig 'foobar.config'
The parser transforms this code into the following:
assign(['foo', 'bar'], 'hello')
block('foo', {
assign(['bar'], 'hello')
}
includeConfig('foobar.config')
Basically each statement is translated to a DSL method (see ConfigDsl
). This "hidden DSL" allows us to control how each statement is executed. In other words, the config parser + hidden DSL implement a simple interpreter for the Nextflow config language, separating the config language from the config execution.
I think this is a good model for how we might separate the Nextflow language from the runtime. Instead of hooking directly into Groovy methods and relying on Groovy MOP, metaclass, etc, we can introduce a hidden DSL which allows us to better control how a given statement (e.g. invoking a process in a workflow) is executed by the runtime.
Java security manager is being removed. I will keep an eye out for other sandboxing techniques, but in the meantime, a simple solution is to provide a "safe" execution mode where assignment expressions are wrapped in string literals.
Really like the way that this is going, great work 👏🏻