fetchngs icon indicating copy to clipboard operation
fetchngs copied to clipboard

Proposal: Static types

Open bentsherman opened this issue 1 year ago • 15 comments

This PR is a showcase of the proposed syntax for static types in Nextflow.

While I started with the goal of simply adding type annotations and type checking, I realized that many aspects of the language needed to be re-thought in order to provide a consistent developer experience. Some of these things can be done now, but I suspect they will be more difficult without static types, so I have tried to show them in their "best form" in this PR.

Changes

  • Update to Nextflow 25.04. See #347 for details.

  • Type annotations. The following declarations can be annotated with a type:

    • workflow params/outputs
    • workflow takes/emits
    • process inputs/outputs
    • function parameters/return
    • local variables (generally not needed)

    Nextflow will use these type annotations to infer the type of every value in the workflow and make sure they are valid.

    The main built-in types are:

    • Integer, Float, Boolean, String: primitive types
    • Path: file or directory
    • List<E>, Set<E>, Bag<E>: collections with various constraints on ordering and uniqueness
    • Map<K,V>: map of key-value pairs
    • Channel<E>: channel (i.e. queue channel)
    • Value<V>: dataflow value (i.e. value channel)
  • Records, optional types. Types can be composed in several ways to facilitate domain modeling:

    • records: a combination of named values, e.g. a sample is a meta map AND some files: record Sample { meta: Map ; files: List<Path> }
    • optionals: any type can be suffixed with ? to denote that it can be null (e.g. String?), otherwise it should never be null
  • Define pipeline params in the main script. Each param has a type. Complex types can be composed from collections, records, and enums. Rather than specifying a particular input format for input files, simply specify a type and Nextflow will use the type like a schema to transparently load from any source (CSV/JSON/etc). Config params are defined separately in the main config. nextflow_schema.json remains unchanged but will be partially generated from the main script / config.

  • Only use params in entry workflow. Params are not known outside the entry workflow. Pass params into processes and workflows as explicit inputs instead.

  • Processes are just functions. Instead of calling a process directly with channels, use operators and supply the process name in place of an operator closure:

    // execute FASTQC in parallel on each input file
    channel.fromPath( "inputs/*.fastq" ).map(FASTQC)
    
    // execute ACCUMULATE sequentially on each input file
    // (replaces experimental recursion)
    channel.fromPath( "inputs/*.txt" ).reduce(ACCUMULATE)
    
  • Simple operators. Use a simple and composable set of operators:

    • collect: collect channel elements into a collection (i.e. bag)
    • cross: cross product of two channels
    • filter: filter a channel based on a condition
    • flatMap: nested scatter
    • groupTuple: nested gather
    • join: relational join of two channels (i.e. horizontal)
    • map: transform a channel
    • mix: concatenate multiple channels (i.e. vertical)
    • reduce: accumulate each channel element into a single value
    • scan: like reduce but emit each intermediate value
    • subscribe: invoke a function for each channel element
    • view: print each channel element

Benefits

  • Well-defined workflow inputs. Workflow inputs are explicitly defined alongside the entry workflow as a set of name-type pairs (i.e. a record type). Complex params can be loaded transparently from any source (file, database, API, etc) as long as the runtime supports it. The JSON schema of a param is inferred from the param's type.

  • Well-defined workflow outputs. Workflow outputs are explicitly defined as a set of name-type pairs (i.e. a record type). Each output can create an index file, which is essentially a serialization of a channel to external storage (file, database, API, etc), and each output can define how its published files are organized in a directory tree. The JSON schema of an output is inferred from the output's type.

  • Make pipeline import-able. Separating the "core" workflow (i.e. SRA) from params and publishing makes it easy to import the pipeline into larger pipelines. See https://github.com/bentsherman/fetchngs2rnaseq for a more complete example.

  • Simpler dataflow logic. Processes are called like an operator closure, generally with a single channel of maps. where the map keys correspond to the process inputs. Additional inputs can be provided as named args. As a result, the amount of boilerplate in the workflow logic and process definition is significantly reduced.

  • Simpler operator library. With a minimal set of operators, users can easily determine which operator to use based on their needs. The operators listed above are statically typed and pertain only to stream operations.

  • Simpler process inputs/outputs. Process inputs/outputs are declared in the same way as workflow takes/emits and pipeline params/outputs, instead of the old custom type qualifiers. Inputs of type Path are automatically staged. Thanks to the simplified dataflow logic described above, tuples are generally not needed.

Extra Notes

This proposed syntax will be enabled by the following internal improvements:

  • New script/config parser, which enables us to evolve the Nextflow language into whatever we want, without being constrained by Groovy syntax (though it still must compile to Groovy AST).

  • Static analysis, which can infer the type of every value based on the declared types of pipeline/workflow/process inputs.

  • Automatic generation of JSON schemas for workflow params and outputs based on type annotations. Preserves support for external tools like Seqera Platform, and lays the groundwork to transparently support different connectors (CSV/JSON file, HTTP API, SQL database, etc).

bentsherman avatar May 01 '24 04:05 bentsherman

FWIW, I have a tiny feedback, that has came up after the previous discussion (which I wasn't aware of):

The fair keyword, describes what is to my knowledge very often called "FIFO" (First-In, First-Out) in other contexts, and might have been a clearer name? (That said, perhaps not worth the change...)

samuell avatar May 02 '24 13:05 samuell

@samuell I would say, just submit an issue for that, it is more of an API change than a syntax change

bentsherman avatar May 02 '24 14:05 bentsherman

Reading through the suggestion in more detail now, I'm a little concerned about this one:

The DAG will be constructed at compile-time instead of run-time, which will allow the DAG to be more comprehensive -- include params and how they connect to processes, include conditional pipeline code (e.g. if-else statements), allow nextflow inspect to list every container that might possibly be used, etc

In my experience, there are some use cases that require run-time generated DAGs, for example when initiating pipeline structure based on values extracted as part of the workflow.

This is common e.g. in machine learning, where you might run hyper-parameter tuning, which generates values which are send to initialize downstream processes, but that might potentially also influence how the DAG is generated downstream.

I've been writing about it before: https://bionics.it/posts/dynamic-workflow-scheduling

Not sure how well this applies here, but want to raise the flag about it, since it is a real limitation we have been running into with other pipeline systems (Luigi).

EDIT: Actually, I guess since we are almost definitely talking about the DAG of processes and not the DAG of tasks, a compile-time DAG would still not rule out all of dynamic scheduling (Since the dataflow paradigm of Nextflow does dynamic task scheduling inherently). Still, it seems some cases of dynamic scheduling might be affected; those that require the process DAG structure to be defined based on outcomes of previous computations.

samuell avatar May 03 '24 08:05 samuell

@samuell yes I'm talking about the process i.e. "abstract" DAG, which Nextflow already constructs before executing the pipeline. But it has to execute the script in order to do this which limits its usefulness.

bentsherman avatar May 03 '24 14:05 bentsherman

How come there are new keywords (let, fn, etc)? What's the difference to def?

mahesh-panchal avatar May 20 '24 12:05 mahesh-panchal

Just another idea to consider. With a formal grammar, we don't have to adhere so closely to Groovy, we can make whatever syntax we want, as long as it can be translated to Groovy AST. So as a demonstration I have replaced def with more specific keywords: fn for function defintion, let for variable that can't be reassigned, var for variable that can be reassigned (essentially def vs final in Groovy).

Notice I also changed how types are specified: <name>: <tyoe> instead of <type> <name>, which I like personally because it emphasizes the semantic name over the type which is optional.

bentsherman avatar May 20 '24 13:05 bentsherman

Is it going to be problematic if people combine groovy and this grammar? For example in exec: blocks.

mahesh-panchal avatar May 20 '24 14:05 mahesh-panchal

It would apply to all Nextflow code, including exec: blocks

bentsherman avatar May 20 '24 14:05 bentsherman

I understood it would apply to all, but my question was really if it was possible that people could mix grammars and if so what would happen: e.g.

exec:
let some_var = do_stuff
def another_thing = do_other_stuff

mahesh-panchal avatar May 20 '24 14:05 mahesh-panchal

We would either drop def in the next DSL version (a hard cut-off) or support it temporarily with a compiler warning

bentsherman avatar May 20 '24 19:05 bentsherman