Proposal: Static types
This PR is a showcase of the proposed syntax for static types in Nextflow.
While I started with the goal of simply adding type annotations and type checking, I realized that many aspects of the language needed to be re-thought in order to provide a consistent developer experience. Some of these things can be done now, but I suspect they will be more difficult without static types, so I have tried to show them in their "best form" in this PR.
Changes
-
Update to Nextflow 25.04. See #347 for details.
-
Type annotations. The following declarations can be annotated with a type:
- workflow params/outputs
- workflow takes/emits
- process inputs/outputs
- function parameters/return
- local variables (generally not needed)
Nextflow will use these type annotations to infer the type of every value in the workflow and make sure they are valid.
The main built-in types are:
Integer,Float,Boolean,String: primitive typesPath: file or directoryList<E>,Set<E>,Bag<E>: collections with various constraints on ordering and uniquenessMap<K,V>: map of key-value pairsChannel<E>: channel (i.e. queue channel)Value<V>: dataflow value (i.e. value channel)
-
Records, optional types. Types can be composed in several ways to facilitate domain modeling:
- records: a combination of named values, e.g. a sample is a meta map AND some files:
record Sample { meta: Map ; files: List<Path> } - optionals: any type can be suffixed with
?to denote that it can be null (e.g.String?), otherwise it should never be null
- records: a combination of named values, e.g. a sample is a meta map AND some files:
-
Define pipeline params in the main script. Each param has a type. Complex types can be composed from collections, records, and enums. Rather than specifying a particular input format for input files, simply specify a type and Nextflow will use the type like a schema to transparently load from any source (CSV/JSON/etc). Config params are defined separately in the main config.
nextflow_schema.jsonremains unchanged but will be partially generated from the main script / config. -
Only use params in entry workflow. Params are not known outside the entry workflow. Pass params into processes and workflows as explicit inputs instead.
-
Processes are just functions. Instead of calling a process directly with channels, use operators and supply the process name in place of an operator closure:
// execute FASTQC in parallel on each input file channel.fromPath( "inputs/*.fastq" ).map(FASTQC) // execute ACCUMULATE sequentially on each input file // (replaces experimental recursion) channel.fromPath( "inputs/*.txt" ).reduce(ACCUMULATE) -
Simple operators. Use a simple and composable set of operators:
collect: collect channel elements into a collection (i.e. bag)cross: cross product of two channelsfilter: filter a channel based on a conditionflatMap: nested scattergroupTuple: nested gatherjoin: relational join of two channels (i.e. horizontal)map: transform a channelmix: concatenate multiple channels (i.e. vertical)reduce: accumulate each channel element into a single valuescan: likereducebut emit each intermediate valuesubscribe: invoke a function for each channel elementview: print each channel element
Benefits
-
Well-defined workflow inputs. Workflow inputs are explicitly defined alongside the entry workflow as a set of name-type pairs (i.e. a record type). Complex params can be loaded transparently from any source (file, database, API, etc) as long as the runtime supports it. The JSON schema of a param is inferred from the param's type.
-
Well-defined workflow outputs. Workflow outputs are explicitly defined as a set of name-type pairs (i.e. a record type). Each output can create an index file, which is essentially a serialization of a channel to external storage (file, database, API, etc), and each output can define how its published files are organized in a directory tree. The JSON schema of an output is inferred from the output's type.
-
Make pipeline import-able. Separating the "core" workflow (i.e.
SRA) from params and publishing makes it easy to import the pipeline into larger pipelines. See https://github.com/bentsherman/fetchngs2rnaseq for a more complete example. -
Simpler dataflow logic. Processes are called like an operator closure, generally with a single channel of maps. where the map keys correspond to the process inputs. Additional inputs can be provided as named args. As a result, the amount of boilerplate in the workflow logic and process definition is significantly reduced.
-
Simpler operator library. With a minimal set of operators, users can easily determine which operator to use based on their needs. The operators listed above are statically typed and pertain only to stream operations.
-
Simpler process inputs/outputs. Process inputs/outputs are declared in the same way as workflow takes/emits and pipeline params/outputs, instead of the old custom type qualifiers. Inputs of type
Pathare automatically staged. Thanks to the simplified dataflow logic described above, tuples are generally not needed.
Extra Notes
This proposed syntax will be enabled by the following internal improvements:
-
New script/config parser, which enables us to evolve the Nextflow language into whatever we want, without being constrained by Groovy syntax (though it still must compile to Groovy AST).
-
Static analysis, which can infer the type of every value based on the declared types of pipeline/workflow/process inputs.
-
Automatic generation of JSON schemas for workflow params and outputs based on type annotations. Preserves support for external tools like Seqera Platform, and lays the groundwork to transparently support different connectors (CSV/JSON file, HTTP API, SQL database, etc).
FWIW, I have a tiny feedback, that has came up after the previous discussion (which I wasn't aware of):
The fair keyword, describes what is to my knowledge very often called "FIFO" (First-In, First-Out) in other contexts, and might have been a clearer name? (That said, perhaps not worth the change...)
@samuell I would say, just submit an issue for that, it is more of an API change than a syntax change
Reading through the suggestion in more detail now, I'm a little concerned about this one:
The DAG will be constructed at compile-time instead of run-time, which will allow the DAG to be more comprehensive -- include params and how they connect to processes, include conditional pipeline code (e.g. if-else statements), allow nextflow inspect to list every container that might possibly be used, etc
In my experience, there are some use cases that require run-time generated DAGs, for example when initiating pipeline structure based on values extracted as part of the workflow.
This is common e.g. in machine learning, where you might run hyper-parameter tuning, which generates values which are send to initialize downstream processes, but that might potentially also influence how the DAG is generated downstream.
I've been writing about it before: https://bionics.it/posts/dynamic-workflow-scheduling
Not sure how well this applies here, but want to raise the flag about it, since it is a real limitation we have been running into with other pipeline systems (Luigi).
EDIT: Actually, I guess since we are almost definitely talking about the DAG of processes and not the DAG of tasks, a compile-time DAG would still not rule out all of dynamic scheduling (Since the dataflow paradigm of Nextflow does dynamic task scheduling inherently). Still, it seems some cases of dynamic scheduling might be affected; those that require the process DAG structure to be defined based on outcomes of previous computations.
@samuell yes I'm talking about the process i.e. "abstract" DAG, which Nextflow already constructs before executing the pipeline. But it has to execute the script in order to do this which limits its usefulness.
How come there are new keywords (let, fn, etc)? What's the difference to def?
Just another idea to consider. With a formal grammar, we don't have to adhere so closely to Groovy, we can make whatever syntax we want, as long as it can be translated to Groovy AST. So as a demonstration I have replaced def with more specific keywords: fn for function defintion, let for variable that can't be reassigned, var for variable that can be reassigned (essentially def vs final in Groovy).
Notice I also changed how types are specified: <name>: <tyoe> instead of <type> <name>, which I like personally because it emphasizes the semantic name over the type which is optional.
Is it going to be problematic if people combine groovy and this grammar? For example in exec: blocks.
It would apply to all Nextflow code, including exec: blocks
I understood it would apply to all, but my question was really if it was possible that people could mix grammars and if so what would happen: e.g.
exec:
let some_var = do_stuff
def another_thing = do_other_stuff
We would either drop def in the next DSL version (a hard cut-off) or support it temporarily with a compiler warning