job data cleanup does not work if `pull-staged` strategy selected
Describe the bug
If the pull-staged strategy is selected job data won't be cleared after the job is finished, leaving data hanging on the executes. This can lead to piling old shuffle files on the executor. One way to prevent is to set executors to cleanup data more aggressively.
To Reproduce
just run default ballista cluster setup
Expected behavior
Shuffle files to be removed when job finishes or when there is no need for them
Additional context
push-basedstrategy works as expected- It might be related to #1175
with pull-staged strategy, executor does not expose grpc service, thus scheduler can not connect to executor to perform data removal.
We need to find good approach to handle this apart from executor ttl
Hi @milenkovicm , I would like to take this issue. In PullStaged, the scheduler can’t call executors directly to clean job data.
Proposal: extend PollWorkResult with CleanJobDataParams so that executors receive job IDs to clean in the next poll_work response.
message PollWorkResult {
repeated TaskDefinition tasks = 1;
repeated CleanJobDataParams cleanups = 2; // new field
}
If this sounds good, I’ll take the issue.
Hi @KR-bluejay, It does make sense
Thanks! I'll take it.
Hi, @milenkovicm I have implemented the pull-based cleanup, but I’m not sure about two things: (Currently, the PR is still in draft mode. #1314)
-
Tests
The scheduler keeps the cleanup job list, and the executors fetch it viapoll_work.
I’m not sure how to properly test this flow.
Do you have any recommendations or existing test patterns I should follow? -
User-facing changes
As far as I can see, this only adds values duringpoll_workfrom scheduler to executor.
I don’t think there are any user-facing changes.
Could you confirm if that’s correct?
Thanks in advance for your advice!
thanks for the pr @KR-bluejay
- i'm not sure, will have a look
- i dont think this is user facing change, does not matter much
will have a look at the pr in next few days, we can discuss then
Got it, thank you for the update! I'll wait for your feedback.
I believe this issue is related to #602