nextflow icon indicating copy to clipboard operation
nextflow copied to clipboard

splitFastq produces wrong number of outputs

Open qscacheri opened this issue 2 years ago • 0 comments

Bug report

Expected behavior and actual behavior

Given a pair of fastq files from S3 with 171390646 reads, calling fastqChannel.splitFastq(by: params.chunkSize, pe: true, file: true) where chunkSize is 10000000, the operator does not create the correct number of output files.

Steps to reproduce the problem

#!/usr/bin/env nextflow 

nextflow.enable.dsl=2

process validate {
    input:
        path(fastqFiles)
    output:
        path("*.f*q*"), includeInputs: true, emit: fastqFiles
        stdout emit: logs
    shell:
    '''
    for f in !{fastqFiles}; do
        echo "${f}:"
        du -h $(realpath $f)
    done
    '''
}

workflow {
    fastqsChannel = Channel.fromPath(params.fastqFiles)
    validate(fastqsChannel)
    
    groupedFastqs = validate.out.fastqFiles
    .map {file -> 
        m = file =~ /.*\/([\w\d\-_]+)?[\-_]R?[1,2]/
        return tuple(m[0][1], file)
    }
    .groupTuple()
    .map { tuple(it[0], it[1][0], it[1][1]) }

    chunksChannel = groupedFastqs.splitFastq(by: params.chunkSize, pe: true, file: true)
    chunksChannel.subscribe { println "Created chunk ${it}"}
    chunksChannel.count().view { "Created ${it} chunks" }

}

Program output

Prints Created 1 chunks

Environment

  • Nextflow version: 22.09.3.edge build 5767
  • Java version:
  • Operating system: Linux
  • Bash version: (use the command $SHELL --version)

Additional context

I'm using AWS batch to test since the files are too big for me to test locally.

qscacheri avatar Sep 16 '22 20:09 qscacheri