nextflow
nextflow copied to clipboard
Automatically delete files marked as temp as soon as not needed anymore
To reduce the footprint of larger workflows it's very useful if temporary files (which are marked as such) could be automatically deleted once they are not used anymore. Yes, this breaks reruns, but for easily recomputed files or large ones (footprint), this makes sense. Using scratch (see #230) is not always possible/wanted (e.g. for very large files and small scratch). It's also not always possible to delete those files as a user (except at the very end of the workflow), because multiple downstream processes running at different times might require them. This feature is for example implemented in snakemake, but maybe it's easily done there because the DAG is computed in advance?
Note, this is different from issue #165 where the goal was to remove non-declared files. The issue contains a useful discussion of the topic nevertheless.
Andreas
I'm adding this for reference. I agree that intermediate files handling needs to be improved but it will require some interval refactoring. Need to investigate. cc @joshua-d-campbell
I've brainstormed a bit more about this issue and actually it should be possible to remove intermediate output files without compromising the resume feature.
First problem, runtime generated DAG: tho execution graph is only generated at runtime, it's generally fully resolved immediately after the workflow execution starts. Therefore it would be enough to defer the output delete after the full resolution of the execution DAG. That's just after the run invocation and before the terminate
.
Second problem is how to identify tasks eligible for output remove. This could be done intercepting a task (successful) completion event. Infer the upstream task in the DAG (easy) and if ALL dependant tasks have been successfully completed then cleanup the task work directory (note that each task can have more than one downstream task). Finally the task for which output have been removed must be marked with a special flag e.g. cached=true
in the trace record.
Third, the resume process need to be re-implemented to take in consideration this logic. Currently when the -resume
flag is specified the pipeline is just re-executed from the beginning, skipping the processes for which the output files already exists. However all (dataflow) output channel are created binding the output files to those channel.
Using the new approach this is not possible any more because the files are deleted therefore the execution has to be skipped up to the first task successfully executed task for which the (above) cached
flag is not true. This that the output files of the last executed task can be picked and re-injected in the dataflow network and restart it.
This may require to introduce a new resume
command #544. It could be also used to implement a kind of dry-run feature as suggested by #844. Finally this could also solve #828.
My two cents: if you can use a flag for indexing the processes (i.e. the sample name) you can define a terminal process that once completed triggers a deletion of the folders connected to that ID (if completed). I'm imagin a situation like this: [sampleID][PROCESS 1] = COMPLETED [sampleID][PROCESS 2] = COMPLETED [sampleID][PROCESS 3] = COMPLETED [sampleID][PROCESS 4 / TERMINAL] = COMPLETED
remove the folders of PROCESS 1 / 2 / 3/ 4 for that [sampleID]
In case you need to resume the pipeline, these samples will be re-run if the data are still in the input folder.
Quick comment: with this feature it will be possible to keep an instance of nextflow running (with watchPath) without having storage problems.
So at a high level, I think I'm missing something. If the state data remains in files, the removal of old items is a good thing to do, but will this increase the filesystem IO contenition and locking as we increase the scale of analysis ?
Since each nextflow task has its own work directory and those directories would be deleted when the data is not needed (read accessed) any more I don't see why there's should be an IO contention on those files. I'm missing something?
I was thinking that maybe a directive that allows the removal of input files when a process is finished will allow to reduce the amount of space needed by a workflow. This should allow to remove the whole folders containing the input files, so that we reduce the number of folders too.
Of course this will not work if these files will be needed by other processes.
Maybe with the new DSL2 where you have to make the graph explicitly this can be achieved. If the cleaning conflicts with a workflow / process an error can be triggered.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I like the ideas in this thread. Automatic removal of "intermediate process files" would be great.
This feature would be a game changer. As an example, one of our pipeline has to process ~10 GB of data produces ~100 GB of temporary data, and as a result the bottleneck is not CPU or memory, but diskspace. This severely limits the throughput of our lab, and results in poor utilization of processing power.
I'm running into this with a pipeline that has similar characteristics to what @olavurmortensen is describing โ the temporary files produced by one tool are very large, so while this workflow's output is maybe a couple hundred gigs, it will need something like 7,000+ GB of disk space during execution.
That said, is there any reason that temporary file cleanup isn't the purview of the process
's script
? There are several ways to delete anything that doesn't match a specific pattern in bash, thereby removing all temporary files except the known inputs/outputs.
I agree that this feature would be extremely powerful ๐๐ป
Maybe also worth noting that it would be good to keep the ability to not auto-clean the intermediate files as well though. For example, a common(ish) use case for us in @nf_core is to resume a pipeline with different parameters. For example, use a different aligner but still use the cached steps for the initial preprocessing. But the majority of the time the auto-cleaning would be what most users want I think ๐งน๐งฝ
I am in the same situation for running a workflow with thousands of samples, and would need dozens of TB. Lots of bam files are needed only by 2 steps and then not anymore, and could be deleted. This feature would be very valuable ๐๐ป
Just a note that this issue came up again on the nf-core Slack and @spficklin has even implemented it at workflow level in https://github.com/SystemsGenetics/GEMmaker
So there is definitely need for this..
To add to @ewels comment. For our GEMmaker workflow (https://github.com/SystemsGenetics/GEMmaker), we needed a workflow that could process 26 thousand RNA-Seq samples from NCBI SRA and we just didn't have the storage to deal with all of the intermediate files (and we had a dedicated 600TB storage system). I think this is really a critical issue for Nextflow because it prevents workflows from massively scaling. Once you hit your storage limit you're done, and if you're using the cloud you incur unnecessary storage costs. The only work around is to run the workflow separately for separate groups of samples so as not to overrun storage limits and that's just very cumbersome with massive numbers of samples.
The solution was really two-fold. In order to not overrun storage we had to first batch and then clean. We found that Nextflow tended to run "horizontally" (for lack of a better word) rather than "vertically". In other words. It tended to run all samples through step 1 before it would move on to step 2. So, even if we did clean intermediate files a few steps later, we would still overrun storage because we had intermediate files for 26K samples. from earlier steps.
To batch samples, we had to implement a folder system where initially we had a metadata file for each sample in a stage
folder. The workflow moves a subset (equal to the number of cores specified) into a processing
folder. The workflow only works on samples with a metadata file in the processing
folder. Once a sample is fully complete the file moves from processing
to a done
folder and a new sample file is added to the processing
folder which the workflow sees and starts processing.
To clean intermediate files we had to trick Nextflow into thinking the files were still there by wiping intermediate files and replacing them with "sparse" files. Essentially we replace the file with an empty version but because the file is "sparse" it still reports as the same size (but doesn't actually consume space on the file system). We have a little BASH script to do that (https://github.com/SystemsGenetics/GEMmaker/blob/master/bin/clean_work_files.sh).
I'm not necessarily advocating the Nextflow follow this approach. It adds complexity to the workflow with a bunch of channels used just for signaling. But I describe it in case others want to borrow this idea until there is an official fix. But also, I wanted to point out the need to first batch before cleaning because just cleaning alone doesn't necessarily solve the storage problem as the workflow runs.
Reminder for my future-self so that I don't lose this again: https://github.com/nextflow-io/nextflow/issues/649 - there is an existing option cleanup = true
that can be used at the base of a config that will wipe all work
intermediate folders if the workflow exits successfully. This is (currently) undocumented.
Note that it's not the same as the feature requested in this issue, as it runs once the pipeline is complete and not as it goes along. But still maybe relevant for some people ending up here from google..
Thanks for sharing that @ewels , this is pretty handy.
In case you have tried this option, does it affect the resume functionality?
In case you have tried this option, does it affect the resume functionality?
I haven't tried it but @jfy133 did. And yes, it wipes all intermediate files so resume is definitely dead ๐
Thanks for sharing this! I
Note that whilst related, #2135 and the cleanup = true
config is different to what was requested here. This issue was originally about deleting intermediate files during pipeline execution, after downstream tasks requiring the files are complete.
Indeed
I was thinking about this some time ago. I think one of the problems will always be that nextflow runs the processes in a horizontal way. If we have 10 steps, it is likely we will run the first one in parallel and we will reach the last process, when cleaning is allowed, at the very end. So I would link this problem to the possibility to run a pipeline vertically by batches (i.e. you read XXX samples and move to the other steps and only when you complete the workflow for an input file you trigger a new execution.
@lucacozzuto yes! we've hit on that as a problem in our workflow and had to employ a hacky solution just like you suggested to get around it (see my post above). I want to add my agreement with your comment.
A possible trick can be allowing a new directive for making batches. You indicate that the process X can process N times from the input channel and then pause. When the other processes are finished, there is a cleaning and a new start. We can use something similar to storeDir to avoid recalculating something useful each time. The only problem would be if we have a process that needs to be triggered when all the batches are processed...
Ehehehe, I'll bring you more coffee because of that
Just for my info, is there any prospect of this being addressed in the near future?
The inability to instantly delete intermediates (as e.g. Snakemake can do) is hitting us hard right now due to some stricter quotas, and our workflows are complex enough without doing https://github.com/nextflow-io/nextflow/issues/452#issuecomment-819733868.
same here, I must admit. the overhead for some nf-core pipelines is 10X the raw data, which is a lot for our HPC environment.
same problem here, I currently need to run nf-core/rnaseq on 700 samples but can't efficiently deal with the temporary files space request. I'll implement the batch & clean approach suggested by @spficklin (thank you so much for sharing the scripts!)
Hi @ChiBia we have this infrastructure built into GEMmaker (https://gemmaker.readthedocs.io/en/latest/). Our workflow does not have as many options as the nf-core/rnaseq workflow but may save you time from implementing your own rna-seq workflow.