modules icon indicating copy to clipboard operation
modules copied to clipboard

Inputs copied to work directory from scratch directory in GATK4SPARK_APPLYBQSR

Open johnoooh opened this issue 9 months ago • 3 comments

Have you checked the docs?

Description of the bug

We are using GATK4SPARK_APPLYBQSR in a pipeline of ours and using the interval function. This means we have many runs of GATK4SPARK_APPLYBQSR running at the same time for one sample. However, we noticed that the work directory was very large following a run and discovered that the input bam was being copied to the work directory from the scratch directory along with the output bam. We believe the issue is from the output line not being specific enough in it's glob when the scratch directory is used: process.scratch=true. This was reported in 3995 but was ultimately not fixed.

tuple val(meta), path("*.bam") , emit: bam, optional: true

The input bam will not be included in the output channel, but is copied to the task work directory. This is mentioned in the nextflow docs. I also copied the relevant section below.

Although the input files matching a glob output declaration are not included in the resulting output channel, these files may still be transferred from the task scratch directory to the original task work directory. Therefore, to avoid unnecessary file copies, avoid using loose wildcards when defining output files, e.g. path ''. Instead, use a prefix or a suffix to restrict the set of matching files to only the expected ones, e.g. path 'prefix_.sorted.bam'.

The output block should be changed to something like this in order to avoid this issue.

output:
    tuple val(meta), path("${prefix}.bam") , emit: bam,  optional: true
    tuple val(meta), path("${prefix}.cram"), emit: cram, optional: true
    path "versions.yml"            , emit: versions

The same issue was brought up for another module here https://github.com/nf-core/modules/issues/3504 and fixed in the same way.

This issue may be present in other modules like GATK4_ADDORREPLACEREADGROUPS. We should probably be more specific in our outputs, and not just glob everything with *.bam or *.cram as this leads to more memory usage in the work directory in this scenario. I'm going to make a PR for GATK4SPARK_APPLYBQSR for now, but be on the lookout for other modules that are like this.

Also, this issue may be present in sarek as well, causing the work directory to balloon on HPC systems using scratch.

System information

nextflow version 24.10.5.5935 CentOS Linux release 7.9.2009 (Core) HPC with LSF scheduler Singularity 3.3.0

johnoooh avatar Mar 13 '25 20:03 johnoooh

Raised this on slack as well as I think this has wider implications than just APPLYBQSR: https://nfcore.slack.com/archives/C043UU89KKQ/p1742808465063059

And will add this to the next maintainers meeting

FriederikeHanssen avatar Mar 24 '25 09:03 FriederikeHanssen

I think this is a limitation of the scratch directive. Normally, an output glob will exclude matching input files unless the includeInputs option is specified, because this part is handled by the Nextflow runtime. But the scratch directive is implemented by the .command.run script, which simply copies the glob from scratch to work without any consideration of input files.

So unless someone can come up with some bash magic to perform the same input file exclusion in the .command.run, we might just need to document this limitation for the scratch directive.

bentsherman avatar Mar 24 '25 10:03 bentsherman

I have been working on some of the other GATK modules for this hackathon. MarkDuplicates is in progress right now. This issue isn't a huge deal most of the time, only when you're working with large bams or fastqs will it really start to fill up your workdir. Thats why I noticed it with APPLYBQSR. So the issue comes up rarely, but I have had it fill up my work directory with 30TB on a WGS run. Currently I'm focusing on modules where the duplication it would be really problematic.

Thanks for the attention on this issue!

johnoooh avatar Mar 25 '25 20:03 johnoooh