nextflow icon indicating copy to clipboard operation
nextflow copied to clipboard

[New Feature] Evaluate Closure for every input file

Open Lehmann-Fabian opened this issue 2 years ago • 5 comments

Until now, Nextflow evaluates Closures to stage multiple Inputfiles only once. Accordingly, it cannot produce individual staging names for different files in one Channel/one task. However, it might be helpful to evaluate the Closure for every file, as requested here: https://github.com/nextflow-io/nextflow/discussions/1998. I solve the problem with this PR while not changing the original logic. If a Closure produces similar names, an increasing counter is added to the similar names. I also thought about adding this to the current logic: if you stage in as * and multiple files have the same name. But this would skip collision warnings, which some users may expect and use for debugging.

For example, the following code shows how to keep folder structures for inputs.

fasta = Channel.fromPath( "/root/*/*.fa" ).buffer(size:10, remainder: true)
process blastThemAll {

    input:
    file {"${sourceObj.parent}/${sourceObj.name}.fa"} from fasta

    """
    find . -name "*"
    """

}

Lehmann-Fabian avatar Feb 03 '22 08:02 Lehmann-Fabian

For datacube-structured Earth Observation datasets, this PR would be extremely helpful!

davidfrantz avatar May 31 '22 14:05 davidfrantz

:warning: 7 God Classes were detected by Lift in this project. Visit the Lift web console for more details.

sonatype-lift[bot] avatar Sep 28 '22 09:09 sonatype-lift[bot]

Hi @pditommaso, I am reaching out regarding this PR that has been open for over a year without any action but is still of great interest. This PR allows you to dynamically name files if you stage a list of files into a process. This is particularly helpful if you want to create a dynamic folder structure.

To provide you with an example of the necessity of this PR: We are requested to transfer our Rangeland workflow to nf-core. In this workflow, we use FORCE, a tool that organizes files in folder structures, which is not out-of-the-box Nextflow compatible. As a result, we had to manually rename files in some instances, such as in the code snippet provided here.

I would greatly appreciate it if you could take some time to review this PR and provide feedback on any changes that could be made to improve it.

Lehmann-Fabian avatar Mar 20 '23 16:03 Lehmann-Fabian

Can you please remind me what you are trying to solve? Nextflow already supports dynamic file name resolution. For example having this

» tree data/
data/
├── one
│   └── file.txt
├── three
│   └── file.txt
└── two
    └── file.txt

and using this script

process foo {
  debug true
  input: 
  tuple val(name), path("$name/*")

  '''
  tree .
  '''
}

workflow {
  channel.fromPath('data/**/*.txt').map { tuple(it.parent.name, it) } | foo 
}

It returns

.
└── three
    └── file.txt -> /Users/pditommaso/demo/data/three/file.txt

1 directory, 1 file

.
└── two
    └── file.txt -> /Users/pditommaso/demo/data/two/file.txt

1 directory, 1 file

.
└── one
    └── file.txt -> /Users/pditommaso/demo/data/one/file.txt

pditommaso avatar Mar 28 '23 14:03 pditommaso

Thank you very much for getting back on this. Sure, I extended the case in your example to also work for more than one file. Accordingly, you should be able to pass multiple files into a single task with its original data structure. In the closure path("$name/*"), the name is fixed if this task has more than one input file.

Let me extend your input:

tree data/
├── one
│   ├── file1.txt
│   ├── file2.txt
│   └── file3.txt
├── three
│   ├── file1.txt
│   ├── file2.txt
│   └── file3.txt
└── two
    ├── file1.txt
    ├── file2.txt
    └── file3.txt

Now in your Nextflow script, I group the files by their name. All file1 together, file2 together,...

workflow {
  channel.fromPath('/execution/data/**/*.txt').map { tuple(it.name, it) }.groupTuple().map{ it[1] } | foo 
}

With the current Nextflow version, I wouldn't be able to get the following:

[74/c871e1] process > foo (2) [100%] 3 of 3 ✔
.
├── one
│   └── file3.txt -> /execution/data/one/file3.txt
├── three
│   └── file3.txt -> /execution/data/three/file3.txt
└── two
    └── file3.txt -> /execution/data/two/file3.txt

3 directories, 3 files

.
├── one
│   └── file1.txt -> /execution/data/one/file1.txt
├── three
│   └── file1.txt -> /execution/data/three/file1.txt
└── two
    └── file1.txt -> /execution/data/two/file1.txt

3 directories, 3 files

.
├── one
│   └── file2.txt -> /execution/data/one/file2.txt
├── three
│   └── file2.txt -> /execution/data/three/file2.txt
└── two
    └── file2.txt -> /execution/data/two/file2.txt

3 directories, 3 files

But this worked with my adjustment and changing the input to:

input: 
path ("${sourceObj.parent.name}/*")

This way of data organization is frequently used for data cubes in remote sensing, and thus, supporting this in Nextflow helps using Nextflow for remote sensing workflows with data cubes.

Lehmann-Fabian avatar Mar 29 '23 12:03 Lehmann-Fabian

Deploy Preview for nextflow-docs-staging ready!

Name Link
Latest commit b526cf9ab849a4fa43b571dde2a6584f0b4801dc
Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/64bfd3e4add64d0008c58f26
Deploy Preview https://deploy-preview-2622--nextflow-docs-staging.netlify.app/process
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

netlify[bot] avatar Jul 25 '23 13:07 netlify[bot]