nextflow icon indicating copy to clipboard operation
nextflow copied to clipboard

Ability to pipe stdout -> stdin between processes

Open ewels opened this issue 1 month ago • 15 comments

This suggestion / request has come up several times, so wanted to collect the latest thread into a GitHub issue ('cc @nh13 @kmhernan @mahesh-panchal @adamrtalbot @muffato )

Unix pipes are a powerful way to stream data from one process to another, without needing to write intermediate data to disk. Currently this is not possible to do between processes in Nextflow. Current practice is either to write intermediate files, or to put multiple tools into a single process and pipe within that single script block.

Support in Nextflow is not a new request, but is technically challenging for several reasons:

  • Making this work with distributed (cloud) clusters
    • May be possible with task batching? https://github.com/nextflow-io/nextflow/pull/3909
  • Figuring out how publishing + retry would work

Piping output between containers does work:

Singularity

singularity exec img1.sif cmd1 | singularity exec img2.sif cmd2

Docker

docker run ubuntu printf "line1\nline2\n" | docker run -i ubuntu grep line2 | docker run -i ubuntu sed 's/line2/line3/g'

@mahesh-panchal has written a minimal example demo using named pipes:

And this was the demo I wrote for using named pipes. The issue there was clean up and potential process deadlock. It would likely work with containers too though: So just for completion, one can send a pipe, but cleaning up ( i.e. removing the pipe afterwards is not simple because of the working directory isolation).

workflow {
    MKFIFO()
    SENDER( params.message, MKFIFO.out.pipe ) 
    RECEIVER ( MKFIFO.out.pipe ) | view
}

process MKFIFO {
    script:
    """
    mkfifo mypipe
    """

    output:
    path "mypipe", emit: pipe
}

process SENDER {
    input:
    val message
    path pipename

    script:
    """
    echo $message > $pipename
    """

    output:
    path pipename
    
}

process RECEIVER {
    input:
    path pipename

    script:
    """
    cat $pipename
    """

    output:
    stdout
}

And this is bad practice as it could easily lead to a process deadlock (edited) The reason for this structure is because named pipes block until they're read from ( i.e stop the process from completing )

ewels avatar May 27 '24 06:05 ewels