nextflow icon indicating copy to clipboard operation
nextflow copied to clipboard

Automatically delete files marked as temp as soon as not needed anymore

Open andreas-wilm opened this issue 7 years ago โ€ข 54 comments

To reduce the footprint of larger workflows it's very useful if temporary files (which are marked as such) could be automatically deleted once they are not used anymore. Yes, this breaks reruns, but for easily recomputed files or large ones (footprint), this makes sense. Using scratch (see #230) is not always possible/wanted (e.g. for very large files and small scratch). It's also not always possible to delete those files as a user (except at the very end of the workflow), because multiple downstream processes running at different times might require them. This feature is for example implemented in snakemake, but maybe it's easily done there because the DAG is computed in advance?

Note, this is different from issue #165 where the goal was to remove non-declared files. The issue contains a useful discussion of the topic nevertheless.

Andreas

andreas-wilm avatar Sep 15 '17 08:09 andreas-wilm

I'm adding this for reference. I agree that intermediate files handling needs to be improved but it will require some interval refactoring. Need to investigate. cc @joshua-d-campbell

pditommaso avatar Sep 19 '17 07:09 pditommaso

I've brainstormed a bit more about this issue and actually it should be possible to remove intermediate output files without compromising the resume feature.

First problem, runtime generated DAG: tho execution graph is only generated at runtime, it's generally fully resolved immediately after the workflow execution starts. Therefore it would be enough to defer the output delete after the full resolution of the execution DAG. That's just after the run invocation and before the terminate.

Second problem is how to identify tasks eligible for output remove. This could be done intercepting a task (successful) completion event. Infer the upstream task in the DAG (easy) and if ALL dependant tasks have been successfully completed then cleanup the task work directory (note that each task can have more than one downstream task). Finally the task for which output have been removed must be marked with a special flag e.g. cached=true in the trace record.

Third, the resume process need to be re-implemented to take in consideration this logic. Currently when the -resume flag is specified the pipeline is just re-executed from the beginning, skipping the processes for which the output files already exists. However all (dataflow) output channel are created binding the output files to those channel.

Using the new approach this is not possible any more because the files are deleted therefore the execution has to be skipped up to the first task successfully executed task for which the (above) cached flag is not true. This that the output files of the last executed task can be picked and re-injected in the dataflow network and restart it.

This may require to introduce a new resume command #544. It could be also used to implement a kind of dry-run feature as suggested by #844. Finally this could also solve #828.

pditommaso avatar Aug 28 '18 13:08 pditommaso

My two cents: if you can use a flag for indexing the processes (i.e. the sample name) you can define a terminal process that once completed triggers a deletion of the folders connected to that ID (if completed). I'm imagin a situation like this: [sampleID][PROCESS 1] = COMPLETED [sampleID][PROCESS 2] = COMPLETED [sampleID][PROCESS 3] = COMPLETED [sampleID][PROCESS 4 / TERMINAL] = COMPLETED

remove the folders of PROCESS 1 / 2 / 3/ 4 for that [sampleID]

In case you need to resume the pipeline, these samples will be re-run if the data are still in the input folder.

lucacozzuto avatar Oct 02 '18 10:10 lucacozzuto

Quick comment: with this feature it will be possible to keep an instance of nextflow running (with watchPath) without having storage problems.

lucacozzuto avatar Oct 02 '18 11:10 lucacozzuto

So at a high level, I think I'm missing something. If the state data remains in files, the removal of old items is a good thing to do, but will this increase the filesystem IO contenition and locking as we increase the scale of analysis ?

PeteClapham avatar Jul 23 '19 11:07 PeteClapham

Since each nextflow task has its own work directory and those directories would be deleted when the data is not needed (read accessed) any more I don't see why there's should be an IO contention on those files. I'm missing something?

pditommaso avatar Jul 26 '19 08:07 pditommaso

I was thinking that maybe a directive that allows the removal of input files when a process is finished will allow to reduce the amount of space needed by a workflow. This should allow to remove the whole folders containing the input files, so that we reduce the number of folders too.

Of course this will not work if these files will be needed by other processes.

Maybe with the new DSL2 where you have to make the graph explicitly this can be achieved. If the cleaning conflicts with a workflow / process an error can be triggered.

lucacozzuto avatar Sep 20 '19 10:09 lucacozzuto

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Apr 27 '20 03:04 stale[bot]

I like the ideas in this thread. Automatic removal of "intermediate process files" would be great.

fmorency avatar Oct 23 '20 18:10 fmorency

This feature would be a game changer. As an example, one of our pipeline has to process ~10 GB of data produces ~100 GB of temporary data, and as a result the bottleneck is not CPU or memory, but diskspace. This severely limits the throughput of our lab, and results in poor utilization of processing power.

olavurmortensen avatar Nov 09 '20 11:11 olavurmortensen

I'm running into this with a pipeline that has similar characteristics to what @olavurmortensen is describing โ€” the temporary files produced by one tool are very large, so while this workflow's output is maybe a couple hundred gigs, it will need something like 7,000+ GB of disk space during execution.

That said, is there any reason that temporary file cleanup isn't the purview of the process's script? There are several ways to delete anything that doesn't match a specific pattern in bash, thereby removing all temporary files except the known inputs/outputs.

jvivian-atreca avatar Nov 23 '20 22:11 jvivian-atreca

I agree that this feature would be extremely powerful ๐Ÿ‘๐Ÿป

Maybe also worth noting that it would be good to keep the ability to not auto-clean the intermediate files as well though. For example, a common(ish) use case for us in @nf_core is to resume a pipeline with different parameters. For example, use a different aligner but still use the cached steps for the initial preprocessing. But the majority of the time the auto-cleaning would be what most users want I think ๐Ÿงน๐Ÿงฝ

ewels avatar Dec 23 '20 09:12 ewels

I am in the same situation for running a workflow with thousands of samples, and would need dozens of TB. Lots of bam files are needed only by 2 steps and then not anymore, and could be deleted. This feature would be very valuable ๐Ÿ‘๐Ÿป

fredericlemoine avatar Mar 03 '21 15:03 fredericlemoine

Just a note that this issue came up again on the nf-core Slack and @spficklin has even implemented it at workflow level in https://github.com/SystemsGenetics/GEMmaker

So there is definitely need for this..

ewels avatar Apr 14 '21 15:04 ewels

To add to @ewels comment. For our GEMmaker workflow (https://github.com/SystemsGenetics/GEMmaker), we needed a workflow that could process 26 thousand RNA-Seq samples from NCBI SRA and we just didn't have the storage to deal with all of the intermediate files (and we had a dedicated 600TB storage system). I think this is really a critical issue for Nextflow because it prevents workflows from massively scaling. Once you hit your storage limit you're done, and if you're using the cloud you incur unnecessary storage costs. The only work around is to run the workflow separately for separate groups of samples so as not to overrun storage limits and that's just very cumbersome with massive numbers of samples.

The solution was really two-fold. In order to not overrun storage we had to first batch and then clean. We found that Nextflow tended to run "horizontally" (for lack of a better word) rather than "vertically". In other words. It tended to run all samples through step 1 before it would move on to step 2. So, even if we did clean intermediate files a few steps later, we would still overrun storage because we had intermediate files for 26K samples. from earlier steps.

To batch samples, we had to implement a folder system where initially we had a metadata file for each sample in a stage folder. The workflow moves a subset (equal to the number of cores specified) into a processing folder. The workflow only works on samples with a metadata file in the processing folder. Once a sample is fully complete the file moves from processing to a done folder and a new sample file is added to the processing folder which the workflow sees and starts processing.

To clean intermediate files we had to trick Nextflow into thinking the files were still there by wiping intermediate files and replacing them with "sparse" files. Essentially we replace the file with an empty version but because the file is "sparse" it still reports as the same size (but doesn't actually consume space on the file system). We have a little BASH script to do that (https://github.com/SystemsGenetics/GEMmaker/blob/master/bin/clean_work_files.sh).

I'm not necessarily advocating the Nextflow follow this approach. It adds complexity to the workflow with a bunch of channels used just for signaling. But I describe it in case others want to borrow this idea until there is an official fix. But also, I wanted to point out the need to first batch before cleaning because just cleaning alone doesn't necessarily solve the storage problem as the workflow runs.

spficklin avatar Apr 14 '21 18:04 spficklin

Reminder for my future-self so that I don't lose this again: https://github.com/nextflow-io/nextflow/issues/649 - there is an existing option cleanup = true that can be used at the base of a config that will wipe all work intermediate folders if the workflow exits successfully. This is (currently) undocumented.

Note that it's not the same as the feature requested in this issue, as it runs once the pipeline is complete and not as it goes along. But still maybe relevant for some people ending up here from google..

ewels avatar May 27 '21 11:05 ewels

Thanks for sharing that @ewels , this is pretty handy.

In case you have tried this option, does it affect the resume functionality?

abhi18av avatar May 27 '21 11:05 abhi18av

In case you have tried this option, does it affect the resume functionality?

I haven't tried it but @jfy133 did. And yes, it wipes all intermediate files so resume is definitely dead ๐Ÿ˜…

ewels avatar May 27 '21 11:05 ewels

Thanks for sharing this! I

lucacozzuto avatar May 27 '21 11:05 lucacozzuto

Note that whilst related, #2135 and the cleanup = true config is different to what was requested here. This issue was originally about deleting intermediate files during pipeline execution, after downstream tasks requiring the files are complete.

ewels avatar Oct 06 '21 04:10 ewels

Indeed

pditommaso avatar Oct 06 '21 07:10 pditommaso

I was thinking about this some time ago. I think one of the problems will always be that nextflow runs the processes in a horizontal way. If we have 10 steps, it is likely we will run the first one in parallel and we will reach the last process, when cleaning is allowed, at the very end. So I would link this problem to the possibility to run a pipeline vertically by batches (i.e. you read XXX samples and move to the other steps and only when you complete the workflow for an input file you trigger a new execution.

lucacozzuto avatar Oct 06 '21 09:10 lucacozzuto

@lucacozzuto yes! we've hit on that as a problem in our workflow and had to employ a hacky solution just like you suggested to get around it (see my post above). I want to add my agreement with your comment.

spficklin avatar Oct 06 '21 21:10 spficklin

A possible trick can be allowing a new directive for making batches. You indicate that the process X can process N times from the input channel and then pause. When the other processes are finished, there is a cleaning and a new start. We can use something similar to storeDir to avoid recalculating something useful each time. The only problem would be if we have a process that needs to be triggered when all the batches are processed...

lucacozzuto avatar Oct 06 '21 21:10 lucacozzuto

pditommaso avatar Oct 07 '21 07:10 pditommaso

Ehehehe, I'll bring you more coffee because of that

lucacozzuto avatar Oct 07 '21 08:10 lucacozzuto

Just for my info, is there any prospect of this being addressed in the near future?

The inability to instantly delete intermediates (as e.g. Snakemake can do) is hitting us hard right now due to some stricter quotas, and our workflows are complex enough without doing https://github.com/nextflow-io/nextflow/issues/452#issuecomment-819733868.

pinin4fjords avatar Nov 08 '21 09:11 pinin4fjords

same here, I must admit. the overhead for some nf-core pipelines is 10X the raw data, which is a lot for our HPC environment.

lescai avatar Nov 08 '21 09:11 lescai

same problem here, I currently need to run nf-core/rnaseq on 700 samples but can't efficiently deal with the temporary files space request. I'll implement the batch & clean approach suggested by @spficklin (thank you so much for sharing the scripts!)

ChiBia avatar Dec 16 '21 08:12 ChiBia

Hi @ChiBia we have this infrastructure built into GEMmaker (https://gemmaker.readthedocs.io/en/latest/). Our workflow does not have as many options as the nf-core/rnaseq workflow but may save you time from implementing your own rna-seq workflow.

spficklin avatar Dec 16 '21 15:12 spficklin