datafusion-ballista [EPIC] Improve job data cleanup logic

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

There are several improvement points in the current job-data deletion flow:

Duplication
The executor has clean_all_shuffle_data alongside other ad-hoc removal logic. These overlap in functionality, making the code harder to maintain and reason about.
Push-based broadcast When the scheduler initiates cleanup, it currently notifies all executors. This is inefficient because only a subset of executors actually hold the job’s data.
Per-job deletion tasks In clean_up_successful_job / clean_up_failed_job, the scheduler spawns a separate delayed task (sleep) for each job and calls state.remove_job(job_id) individually. This results in many small tasks and RPCs, which could be batched more efficiently.

Describe the solution you'd like Unify cleanup behind a single, testable “deletion facility”:

Deduplicate logic with clean_all_shuffle_data; extract/keep a shared async remover (e.g., remove_job_dir) with safety checks.
Targeted push: notify only executors that actually hold the job’s data (no broadcast).
Batching: we already dispatch periodically; change each tick to send one batched remove_jobs(Vec<JobId>) for all pending IDs rather than spawning per-job sleeps and individual removals.

Describe alternatives you've considered

Additional context Related: #1219 , #1314

Sep 11 '25 10:09 KR-bluejay

I’d be happy to work on this as a follow-up, if this direction sounds good.

Sep 11 '25 10:09 KR-bluejay

I think you could make an epic in this area

There are few things I would like to propose regarding to files:

shuffle files are absolute paths, i would like to make them relative to flight store directory (improves security)
I would like to change how files are organised, instead of having job_id/shuffle... i would like to prefix files with session_id/job_id/shuffle... or session_id/cache/ or session_id/job_id/operator_state ... i would like to be able to delete any of them, but also delete all files for given session_id when client disconnects
should we make arrow flight server owner of files, and expose actions on them to delete files ?
#558

wdyt @KR-bluejay ?

Sep 11 '25 10:09 milenkovicm

Thanks for the great ideas. As I'm still getting familiar with the internals of Ballista, I might be missing some key points, but I wanted to share a few thoughts that came to mind

Relative paths under Flight store I like the idea, with one caveat: if we switch to relative paths, we should harden against path traversal (e.g. ../..). A central PathResolver that joins with the Flight store root, canonicalizes, and rejects anything escaping the root (plus explicit checks like starts_with(root)) would make this safe.
Delete-on-client-disconnect It sounds useful, but I think it should be configurable and possibly gated by a grace period (and/or liveness recheck). In practice clients can drop due to transient network issues; auto-deleting immediately might be risky. Spark has similar knobs around cleanup; having a setting like:
- delete_on_disconnect = off|immediate|grace(N seconds)
- scope: session_only | session_and_job_state | everything could help operators choose the right behavior. Also for streaming-style clients that “fire-and-forget” and await async responses, immediate deletion might be surprising.
Flight as file owner (Actions)
The idea is interesting, but I would not adopt it now. Ballista’s strength should come from parallel/distributed performance. However, introducing a new centralized owner for deletion at this stage risks adding coordination overhead and bottlenecks that could actually hurt scalability.
Given current resources, fully reworking the ownership model also feels too heavy to justify at this point.

Sep 11 '25 12:09 KR-bluejay

For now I think we should keep cleanup decentralized and focus on making scheduler + executor paths straightforward and robust (deduplication, batching, safe relative paths). Once the system is proven solid and we have more clarity, we could revisit whether centralizing deletion control in Flight brings enough value to offset the added complexity.

Sep 11 '25 12:09 KR-bluejay

Flight as file owner (Actions) The idea is interesting, but I would not adopt it now. Ballista’s strength should come from parallel/distributed performance. However, introducing a new centralized owner for deletion at this stage risks adding coordination overhead and bottlenecks that could actually hurt scalability. Given current resources, fully reworking the ownership model also feels too heavy to justify at this point.

I did not mean to have centralised file storage, what i meant is that flight store at each executor is logical file owner, so we split job related operations, and file related operations

Sep 11 '25 12:09 milenkovicm

Got it, thanks for clarifying! — that makes more sense.
Splitting responsibilities so that the scheduler focuses on job operations, while Flight store owns its local files, sounds reasonable to me.

Is there any part of this direction where you’d like help or contributions?

Sep 11 '25 12:09 KR-bluejay

my first priority is #1315 which may touch flight store and shuffle reader/writer, once I manage to merge that one we can discuss how can i help

Sep 11 '25 12:09 milenkovicm

Thanks for sharing, I’ll wait for that then!

Sep 11 '25 12:09 KR-bluejay

please do start with tasks you'd like to progress, like file paths and deduplication. I dont think there will be big merging impact of #1315

Sep 11 '25 12:09 milenkovicm

you could use this issue as epic, where you could list tasks you are working on, or you plan to work on

Sep 11 '25 12:09 milenkovicm

Got it, that makes sense.
Then it might be better to treat this issue as an epic, and split out smaller issues/PRs for each task (e.g. file path handling, deduplication, batching).

That way progress is clearer and easier to track.

Sep 11 '25 12:09 KR-bluejay