datafusion-ballista icon indicating copy to clipboard operation
datafusion-ballista copied to clipboard

Shuffle files should get deleted immediately after job finishes by default

Open andygrove opened this issue 2 years ago • 4 comments

Is your feature request related to a problem or challenge? Please describe what you are trying to do. I am trying to run automated benchmarks and keep running out of disk space

Describe the solution you'd like I would like shuffle files and other temp files to be deleted immediately after a job finishes by default.

Describe alternatives you've considered None

Additional context None

andygrove avatar Jan 14 '23 20:01 andygrove

There is one exception, the finally query results are also some kind of shuffle files, we can not delete those shuffle files after the job finishes. I think we need a way to differ the shuffle files generated by the intermediate stages and the final result stage.

@thinkharderdev @yahoNanJing Do you have some idea ?

mingmwang avatar Feb 09 '23 02:02 mingmwang

It might also be better to remove files after the next stage finishes instead of waiting on job to finish? Should help with disk consumption for very large jobs.

Dandandan avatar Feb 09 '23 07:02 Dandandan

It might also be better to remove files after the next stage finishes instead of waiting on job to finish? Should help with disk consumption for very large jobs.

I would prefer to keep those files until the whole job finish, there are cases that some stage/task will need to retry in some cases, for example due to executors lost, some finished stages will have to recompute the missing files.

mingmwang avatar Feb 09 '23 07:02 mingmwang

I think we need a way to differ the shuffle files generated by the intermediate stages and the final result stage.

I think this should be straightforward in principle since we know from the final execution graph whether each partition is final output partition or not.

thinkharderdev avatar Feb 09 '23 12:02 thinkharderdev