nextflow icon indicating copy to clipboard operation
nextflow copied to clipboard

Task chaining

Open bentsherman opened this issue 2 years ago • 3 comments

New feature

This idea was inspired in part by this blog post as well as my own work with the GEMmaker workflow.

Usage Scenario

Consider a pipeline, such as a typical RNAseq piipeline, that processes many samples in parallel, but each sample goes through a sequence of processes. This setup has several pain points -- provisioning separate jobs / VMs for each process, moving data from process to process, using local scratch storage effectively, downloading container images, etc.

As the above blog post demonstrates, these paint points can be alleviated by isolating each individual sub-workflow (processing a single sample) to a single node. This way, only one job / VM is provisioned per sample, local scratch can be used in a way that significantly reduces traffic to shared storage, and its easier to control the level of parallelism. Also, if a single container image is used for the entire sub-workflow then it can be pulled once and cached.

Suggested Implementation

Basically there are three ways to implement this sort of data locality. Each approach is syntax sugar atop the previous approach:

  1. Condense the sub-workflow into a single Nextflow process. You then have to put all the steps into a single container, you can't track usage of individual processes anymore, and you lose the ability to recover individual processes on a resume.

  2. Split the workflow into two workflows -- the first defines the sub-workflow, the second workflow runs each sub-workflow as a standalone Nextflow pipeline with the local executor so that each sub-workflow is executed on a single node. The processes are kept separate as they were, but the sub-workflow runs (caching) are not integrated with the main workflow run.

  3. Extend Nextflow to do (2) natively when a sub-workflow is annotated accordingly (with a directive or something). Nextflow should be able to take an arbitrary workflow, provision a task with a "nextflow-in-nextflow" task script that runs nextflow on the sub-workflow.

My e2e branch of GEMmaker implemented (1). The blog post implemented (2). In my mind, (3) is what we actually need in both cases.

The details on this idea still need to be worked out. But I think this feature would improve performance for a lot of Nextflow pipelines. Would love to hear from other users on whether this idea makes sense, how it should be implemented, etc.

bentsherman avatar Dec 27 '21 20:12 bentsherman