datafusion-ballista
datafusion-ballista copied to clipboard
Shuffle files should get deleted immediately after job finishes by default
Is your feature request related to a problem or challenge? Please describe what you are trying to do. I am trying to run automated benchmarks and keep running out of disk space
Describe the solution you'd like I would like shuffle files and other temp files to be deleted immediately after a job finishes by default.
Describe alternatives you've considered None
Additional context None
There is one exception, the finally query results are also some kind of shuffle files, we can not delete those shuffle files after the job finishes. I think we need a way to differ the shuffle files generated by the intermediate stages and the final result stage.
@thinkharderdev @yahoNanJing Do you have some idea ?
It might also be better to remove files after the next stage finishes instead of waiting on job to finish? Should help with disk consumption for very large jobs.
It might also be better to remove files after the next stage finishes instead of waiting on job to finish? Should help with disk consumption for very large jobs.
I would prefer to keep those files until the whole job finish, there are cases that some stage/task will need to retry in some cases, for example due to executors lost, some finished stages will have to recompute the missing files.
I think we need a way to differ the shuffle files generated by the intermediate stages and the final result stage.
I think this should be straightforward in principle since we know from the final execution graph whether each partition is final output partition or not.