datafusion-ballista icon indicating copy to clipboard operation
datafusion-ballista copied to clipboard

job data cleanup does not work if `pull-staged` strategy selected

Open milenkovicm opened this issue 9 months ago • 1 comments

Describe the bug

If the pull-staged strategy is selected job data won't be cleared after the job is finished, leaving data hanging on the executes. This can lead to piling old shuffle files on the executor. One way to prevent is to set executors to cleanup data more aggressively.

To Reproduce

just run default ballista cluster setup

Expected behavior

Shuffle files to be removed when job finishes or when there is no need for them

Additional context

  • push-based strategy works as expected
  • It might be related to #1175

milenkovicm avatar Apr 01 '25 08:04 milenkovicm

with pull-staged strategy, executor does not expose grpc service, thus scheduler can not connect to executor to perform data removal.

We need to find good approach to handle this apart from executor ttl

milenkovicm avatar May 21 '25 20:05 milenkovicm

Hi @milenkovicm , I would like to take this issue. In PullStaged, the scheduler can’t call executors directly to clean job data.

Proposal: extend PollWorkResult with CleanJobDataParams so that executors receive job IDs to clean in the next poll_work response.

message PollWorkResult {
  repeated TaskDefinition tasks   = 1;
  repeated CleanJobDataParams cleanups = 2;  // new field
}

If this sounds good, I’ll take the issue.

KR-bluejay avatar Sep 02 '25 07:09 KR-bluejay

Hi @KR-bluejay, It does make sense

milenkovicm avatar Sep 02 '25 09:09 milenkovicm

Thanks! I'll take it.

KR-bluejay avatar Sep 02 '25 09:09 KR-bluejay

Hi, @milenkovicm I have implemented the pull-based cleanup, but I’m not sure about two things: (Currently, the PR is still in draft mode. #1314)

  1. Tests
    The scheduler keeps the cleanup job list, and the executors fetch it via poll_work.
    I’m not sure how to properly test this flow.
    Do you have any recommendations or existing test patterns I should follow?

  2. User-facing changes
    As far as I can see, this only adds values during poll_work from scheduler to executor.
    I don’t think there are any user-facing changes.
    Could you confirm if that’s correct?

Thanks in advance for your advice!

KR-bluejay avatar Sep 08 '25 06:09 KR-bluejay

thanks for the pr @KR-bluejay

  1. i'm not sure, will have a look
  2. i dont think this is user facing change, does not matter much

will have a look at the pr in next few days, we can discuss then

milenkovicm avatar Sep 08 '25 08:09 milenkovicm

Got it, thank you for the update! I'll wait for your feedback.

KR-bluejay avatar Sep 08 '25 08:09 KR-bluejay

I believe this issue is related to #602

milenkovicm avatar Sep 11 '25 09:09 milenkovicm