nextflow icon indicating copy to clipboard operation
nextflow copied to clipboard

Config parser (and loader)

Open bentsherman opened this issue 5 months ago • 7 comments

Close #2723

This PR explores one possible solution to the config overhaul by implementing a custom parser for config files, using a similar approach as #4613.

The linked issue initially proposed adopting a new config format. However, most of the proposed alternatives do not have all of the capabilities that we need out of the box or have their own downsides. I think the actual problem is that the config file has been stretched to cover use cases for which it wasn't originally designed.

Between this PR, the module config (#4510), and moving the params definition from the config file to the params schema, I think this will relieve enough pressure from the config file that we can keep it.

This custom parser enforces a much simpler syntax:

  • Only three statements are allowed: assignments, blocks, and config inclusion.
  • The left-hand side of an assignment can only be a path expression (foo.bar.baz ...).
  • The right-hand side can be an arbitrary expression, including closures.
  • Within a closure expression, arbitrary Groovy code is still allowed, although right now there is only a minimal subset in the parser.

As a result, the config parser is much simpler and the code is easier to read. There is no more dependence on metaclasses, instead there is a simple builder DSL for executing a config file and producing a map.

The vast majority of existing configuration is still supported, with only a few minor changes needed:

  • Include statements are sometimes wrapped in a try-catch to prevent a missing config file from failing the pipeline. For this we can either provide both includeConfig and requireConfig, or demote the error to a warning when the config file is remote

  • Custom functions like check_max are no longer supported. Instead, I suggest that we import the lib directory into config files as we do for scripts, so that these helper functions can simply be defined there

  • Conditional statements are no longer supported. With module config they should no longer be needed

  • Params should be defined only in nextflow_schema.json to prevent issues like #2662 where a param is referenced before params are fully resolved. More of a best practice than a hard requirement. We can still support params defined in the config (likely with a warning), and overriding param default values in a profile is still a valid use case.

All of these issues can be resolved independently of this PR, in case we need to smooth the migration process.

It is still possible to execute arbitrary code with side effects, just not as easily as before. We can't really avoid this if we want to support dynamic expressions:

foo.bar = "${println 'pwned'}"

TODO:

  • [ ] add custom lazy resolution of params
  • [ ] review subset of Groovy that is supported within closures
  • [ ] support lib directory for using helper functions in config files
  • [ ] explore Java security manager to prevent side effects in dynamic strings

bentsherman avatar Feb 14 '24 22:02 bentsherman

Deploy Preview for nextflow-docs-staging ready!

Name Link
Latest commit a6d89cdeaa31b12df54bb7eb44ca419943c4b68a
Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/66157b72ef7761000824c06f
Deploy Preview https://deploy-preview-4744--nextflow-docs-staging.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

netlify[bot] avatar Feb 14 '24 22:02 netlify[bot]

Research updates:

  • Groovy compiler provides an interface to use a custom parser (in order to ease the transition from Antlr2 to Antlr4), this can be used to inject our custom parser without much hassle
  • Config schema effort can be pursued independently of the parser, see #4201
  • Might be able to use the Java security manager to prevent execution of unsafe code in the config file such as file I/O, subprocesses

bentsherman avatar Mar 05 '24 09:03 bentsherman

Here's an example to illustrate how the config parser "executes" the config file. Given this config:

foo.bar = 'hello'

foo {
  bar = 'hello'
}

includeConfig 'foobar.config'

The parser transforms this code into the following:

assign(['foo', 'bar'], 'hello')

block('foo', {
  assign(['bar'], 'hello')
}

includeConfig('foobar.config')

Basically each statement is translated to a DSL method (see ConfigDsl). This "hidden DSL" allows us to control how each statement is executed. In other words, the config parser + hidden DSL implement a simple interpreter for the Nextflow config language, separating the config language from the config execution.

I think this is a good model for how we might separate the Nextflow language from the runtime. Instead of hooking directly into Groovy methods and relying on Groovy MOP, metaclass, etc, we can introduce a hidden DSL which allows us to better control how a given statement (e.g. invoking a process in a workflow) is executed by the runtime.

bentsherman avatar Mar 08 '24 05:03 bentsherman

Java security manager is being removed. I will keep an eye out for other sandboxing techniques, but in the meantime, a simple solution is to provide a "safe" execution mode where assignment expressions are wrapped in string literals.

bentsherman avatar Mar 11 '24 13:03 bentsherman

Really like the way that this is going, great work 👏🏻

ewels avatar Apr 07 '24 07:04 ewels