[EPIC] Improve job data cleanup logic
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
There are several improvement points in the current job-data deletion flow:
-
Duplication
The executor hasclean_all_shuffle_dataalongside other ad-hoc removal logic. These overlap in functionality, making the code harder to maintain and reason about. -
Push-based broadcast When the scheduler initiates cleanup, it currently notifies all executors. This is inefficient because only a subset of executors actually hold the job’s data.
-
Per-job deletion tasks In
clean_up_successful_job/clean_up_failed_job, the scheduler spawns a separate delayed task (sleep) for each job and callsstate.remove_job(job_id)individually. This results in many small tasks and RPCs, which could be batched more efficiently.
Describe the solution you'd like Unify cleanup behind a single, testable “deletion facility”:
- Deduplicate logic with
clean_all_shuffle_data; extract/keep a shared async remover (e.g.,remove_job_dir) with safety checks. - Targeted push: notify only executors that actually hold the job’s data (no broadcast).
- Batching: we already dispatch periodically; change each tick to send one batched
remove_jobs(Vec<JobId>)for all pending IDs rather than spawning per-job sleeps and individual removals.
Describe alternatives you've considered
Additional context Related: #1219 , #1314
I’d be happy to work on this as a follow-up, if this direction sounds good.
I think you could make an epic in this area
There are few things I would like to propose regarding to files:
- shuffle files are absolute paths, i would like to make them relative to flight store directory (improves security)
- I would like to change how files are organised, instead of having
job_id/shuffle...i would like to prefix files withsession_id/job_id/shuffle...orsession_id/cache/orsession_id/job_id/operator_state... i would like to be able to delete any of them, but also delete all files for given session_id when client disconnects - should we make arrow flight server owner of files, and expose actions on them to delete files ?
- #558
wdyt @KR-bluejay ?
Thanks for the great ideas. As I'm still getting familiar with the internals of Ballista, I might be missing some key points, but I wanted to share a few thoughts that came to mind
-
Relative paths under Flight store I like the idea, with one caveat: if we switch to relative paths, we should harden against path traversal (e.g.
../..). A centralPathResolverthat joins with the Flight store root, canonicalizes, and rejects anything escaping the root (plus explicit checks likestarts_with(root)) would make this safe. -
Delete-on-client-disconnect It sounds useful, but I think it should be configurable and possibly gated by a grace period (and/or liveness recheck). In practice clients can drop due to transient network issues; auto-deleting immediately might be risky. Spark has similar knobs around cleanup; having a setting like:
delete_on_disconnect = off|immediate|grace(N seconds)- scope:
session_only | session_and_job_state | everythingcould help operators choose the right behavior. Also for streaming-style clients that “fire-and-forget” and await async responses, immediate deletion might be surprising.
-
Flight as file owner (Actions)
The idea is interesting, but I would not adopt it now. Ballista’s strength should come from parallel/distributed performance. However, introducing a new centralized owner for deletion at this stage risks adding coordination overhead and bottlenecks that could actually hurt scalability.
Given current resources, fully reworking the ownership model also feels too heavy to justify at this point.
For now I think we should keep cleanup decentralized and focus on making scheduler + executor paths straightforward and robust (deduplication, batching, safe relative paths). Once the system is proven solid and we have more clarity, we could revisit whether centralizing deletion control in Flight brings enough value to offset the added complexity.
- Flight as file owner (Actions) The idea is interesting, but I would not adopt it now. Ballista’s strength should come from parallel/distributed performance. However, introducing a new centralized owner for deletion at this stage risks adding coordination overhead and bottlenecks that could actually hurt scalability. Given current resources, fully reworking the ownership model also feels too heavy to justify at this point.
I did not mean to have centralised file storage, what i meant is that flight store at each executor is logical file owner, so we split job related operations, and file related operations
Got it, thanks for clarifying! — that makes more sense.
Splitting responsibilities so that the scheduler focuses on job operations, while Flight store owns its local files, sounds reasonable to me.
Is there any part of this direction where you’d like help or contributions?
my first priority is #1315 which may touch flight store and shuffle reader/writer, once I manage to merge that one we can discuss how can i help
Thanks for sharing, I’ll wait for that then!
please do start with tasks you'd like to progress, like file paths and deduplication. I dont think there will be big merging impact of #1315
you could use this issue as epic, where you could list tasks you are working on, or you plan to work on
Got it, that makes sense.
Then it might be better to treat this issue as an epic, and split out smaller issues/PRs for each task
(e.g. file path handling, deduplication, batching).
That way progress is clearer and easier to track.