celeborn [CELEBORN-1855] LifecycleManager return appshuffleId for non barrier stage when fetch fail has been reported

…stage when fetch fail has been reported

What changes were proposed in this pull request?

for non barrier shuffle read stage, LifecycleManager#handleGetShuffleIdForApp always return appshuffleId whether fetch status is true or not.

Why are the changes needed?

As described in jira, If LifecycleManager only returns appshuffleId whose fetch status is success, the task will fail directly to "there is no finished map stage associated with", but previous fetch fail event reported may not be fatal.So just give it a chance

Does this PR introduce any user-facing change?

How was this patch tested?

Feb 07 '25 06:02 buska88

According to your Jira ticket, "that shuffle fetch fails does not lead to stage fail because task speculation and another attempts succeed", I think the quoted scenario should not happen if you have PR #3080 and #2921.

Do you have these two PRs for your Celeborn client?

Feb 11 '25 08:02 FMX

According to your Jira ticket, "that shuffle fetch fails does not lead to stage fail because task speculation and another attempts succeed", I think the quoted scenario should not happen if you have PR #3080 and #2921.

Do you have these two PRs for your Celeborn client?

No.We mainly use branch-0.5.Would you please consider that merging these two pr into branch-0.5-rc?Those are important features and many users using branch-0.5 may need them.

Feb 12 '25 12:02 buska88

According to your Jira ticket, "that shuffle fetch fails does not lead to stage fail because task speculation and another attempts succeed", I think the quoted scenario should not happen if you have PR #3080 and #2921. Do you have these two PRs for your Celeborn client?

No.We mainly use branch-0.5.Would you please consider that merging these two pr into branch-0.5-rc?Those are important features and many users using branch-0.5 may need them.

As for this pr, it occurs to me that when a task of a stage throws fetchFail exception, in the small duration between fetchFail exception reported and the stage is aborted, following read tasks cans still get appId and finish shuffle reading.Then when stage retrying, these tasks may not re compute, which save resources.If following tasks fail due to failing to get appId, then they will be recomputed in the next stage-retry inevitablely. I think this pr has a tiny influence when we have those two prs you mentioned, so it seems an unnecessary pr.

Feb 12 '25 12:02 buska88

According to your Jira ticket, "that shuffle fetch fails does not lead to stage fail because task speculation and another attempts succeed", I think the quoted scenario should not happen if you have PR #3080 and #2921. Do you have these two PRs for your Celeborn client?

No.We mainly use branch-0.5.Would you please consider that merging these two pr into branch-0.5-rc?Those are important features and many users using branch-0.5 may need them.

This PR can be helpful in branch-0.5 because branch-0.5 doesn't have PR #2921. Drafting a new RC for branch-0.5 can be a good idea.

Feb 13 '25 08:02 FMX

According to your Jira ticket, "that shuffle fetch fails does not lead to stage fail because task speculation and another attempts succeed", I think the quoted scenario should not happen if you have PR #3080 and #2921. Do you have these two PRs for your Celeborn client?

No.We mainly use branch-0.5.Would you please consider that merging these two pr into branch-0.5-rc?Those are important features and many users using branch-0.5 may need them.

This PR can be helpful in branch-0.5 because branch-0.5 doesn't have PR #2921. Drafting a new RC for branch-0.5 can be a good idea.

Does this PR #2921 also need to be backported to the branch-0.5 branch? @FMX

Feb 13 '25 11:02 gaoyajun02

According to your Jira ticket, "that shuffle fetch fails does not lead to stage fail because task speculation and another attempts succeed", I think the quoted scenario should not happen if you have PR #3080 and #2921. Do you have these two PRs for your Celeborn client?

No.We mainly use branch-0.5.Would you please consider that merging these two pr into branch-0.5-rc?Those are important features and many users using branch-0.5 may need them.

This PR can be helpful in branch-0.5 because branch-0.5 doesn't have PR #2921. Drafting a new RC for branch-0.5 can be a good idea.

Does this PR #2921 also need to be backported to the branch-0.5 branch? @FMX

There are too many conflicts between PR#2921 and branch-0.5. So the PR is not backported.

Feb 17 '25 03:02 FMX

@buska88 It would be better to check the validity of a shuffle Id after you get it instead of using a shuffle ID that marked as invalid.

You can add the check here https://github.com/apache/celeborn/blob/6a836f9523b7eb03d3eff4bf1c23349bf8270064/client/src/main/java/org/apache/celeborn/client/ShuffleClientImpl.java#L612-L630

Feb 17 '25 03:02 FMX

@buska88 It would be better to check the validity of a shuffle Id after you get it instead of using a shuffle ID that marked as invalid.

You can add the check here

https://github.com/apache/celeborn/blob/6a836f9523b7eb03d3eff4bf1c23349bf8270064/client/src/main/java/org/apache/celeborn/client/ShuffleClientImpl.java#L612-L630

Sorry, I don't understand the purpose of checking shuffle id here. If the shuffle id is invalid, what should we do next?

Feb 20 '25 10:02 buska88

@buska88 It would be better to check the validity of a shuffle Id after you get it instead of using a shuffle ID that marked as invalid. You can add the check here https://github.com/apache/celeborn/blob/6a836f9523b7eb03d3eff4bf1c23349bf8270064/client/src/main/java/org/apache/celeborn/client/ShuffleClientImpl.java#L612-L630

Sorry, I don't understand the purpose of checking shuffle id here. If the shuffle id is invalid, what should we do next?

Maybe we can also throw FetchFailureException and wait task retry

Feb 28 '25 02:02 RexXiong

@buska88 It would be better to check the validity of a shuffle Id after you get it instead of using a shuffle ID that marked as invalid. You can add the check here https://github.com/apache/celeborn/blob/6a836f9523b7eb03d3eff4bf1c23349bf8270064/client/src/main/java/org/apache/celeborn/client/ShuffleClientImpl.java#L612-L630

Sorry, I don't understand the purpose of checking shuffle id here. If the shuffle id is invalid, what should we do next?

Maybe we can also throw FetchFailureException and wait task retry

OKay.I choose to throw FetchFailException in CelebornShuffleReader(shuffleclientImpl doesn't have spark dependency).And you can have a look at it when you have free time cc @FMX @RexXiong

Mar 04 '25 10:03 buska88

This PR is stale because it has been open 20 days with no activity. Remove stale label or comment or this will be closed in 10 days.

Mar 25 '25 08:03 github-actions[bot]

25/04/29 21:42:07 INFO dispatcher-CoarseGrainedScheduler TaskSetManager: Starting task 3928.1 in stage 23.0 (TID 29935, zw06-data-hdp-dn29550.mt, executor 5432, partition 3928, PROCESS_LOCAL, 8328 bytes)
25/04/29 21:42:07 WARN task-result-getter-3 TaskSetManager: Lost task 1434.0 in stage 23.0 (TID 26608, zw06-data-hdp-dn29550.mt, executor 5432): java.lang.OutOfMemoryError: GC overhead limit exceeded

25/04/29 21:42:07 INFO task-result-getter-3 TaskSetManager: Handle failed task, add task to pendingTasks, task 1434.0 in stage 23.0 (TID 26608, zw06-data-hdp-dn29550.mt, executor 5432)
25/04/29 21:42:07 INFO Reporter ApplicationMaster: AppMaster: targetNumExecutors=1400, pendingAllocate=0, runningExecutors=1400. 
25/04/29 21:42:07 INFO dispatcher-BlockManagerMaster BlockManagerMasterEndpoint: Registering block manager zw06-data-hdp-dn27371.mt:26633 with 2004.6 MiB RAM, BlockManagerId(6004, zw06-data-hdp-dn27371.mt, 26633, None)
25/04/29 21:42:07 INFO dispatcher-CoarseGrainedScheduler TaskSetManager: Starting task 1434.1 in stage 23.0 (TID 29936, zw06-data-hdp-dn27371.mt, executor 6004, partition 1434, PROCESS_LOCAL, 8328 bytes)
25/04/29 21:42:07 INFO dispatcher-CoarseGrainedScheduler TaskSetManager: Starting task 3929.1 in stage 23.0 (TID 29937, zw06-data-hdp-dn27371.mt, executor 6004, partition 3929, PROCESS_LOCAL, 8328 bytes)
25/04/29 21:42:07 INFO dispatcher-CoarseGrainedScheduler TaskSetManager: Starting task 3927.1 in stage 23.0 (TID 29938, zw06-data-hdp-dn27371.mt, executor 6004, partition 3927, PROCESS_LOCAL, 8328 bytes)
25/04/29 21:42:08 INFO dispatcher-CoarseGrainedScheduler YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 5432.
25/04/29 21:42:08 INFO Reporter ApplicationMaster: AppMaster: targetNumExecutors=1400, pendingAllocate=0, runningExecutors=1400. 
25/04/29 21:42:08 INFO dag-scheduler-event-loop DAGScheduler: Executor lost: 5432 (epoch 6)
25/04/29 21:42:08 INFO dispatcher-BlockManagerMaster BlockManagerMasterEndpoint: Trying to remove executor 5432 from BlockManagerMaster.
25/04/29 21:42:08 INFO dispatcher-BlockManagerMaster BlockManagerMasterEndpoint: Removing block manager BlockManagerId(5432, zw06-data-hdp-dn29550.mt, 17839, None)
25/04/29 21:42:08 INFO dag-scheduler-event-loop BlockManagerMaster: Removed 5432 successfully in removeExecutor
25/04/29 21:42:08 INFO dispatcher-CoarseGrainedScheduler YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.119.69.11:58434) with ID 5996
25/04/29 21:42:08 INFO spark-listener-group-executorManagement ExecutorMonitor: New executor 5996 has registered (new total is 1367)
25/04/29 21:42:08 INFO Reporter ApplicationMaster: AppMaster: targetNumExecutors=1400, pendingAllocate=0, runningExecutors=1400. 
25/04/29 21:42:08 INFO dispatcher-BlockManagerMaster BlockManagerMasterEndpoint: Registering block manager zw06-data-hdp-dn29769.mt:22001 with 2004.6 MiB RAM, BlockManagerId(5996, zw06-data-hdp-dn29769.mt, 22001, None)
25/04/29 21:42:08 INFO dispatcher-CoarseGrainedScheduler TaskSetManager: Starting task 135.1 in stage 23.0 (TID 29939, zw06-data-hdp-dn29769.mt, executor 5996, partition 135, PROCESS_LOCAL, 8328 bytes)
25/04/29 21:42:08 INFO dispatcher-CoarseGrainedScheduler TaskSetManager: Starting task 136.1 in stage 23.0 (TID 29940, zw06-data-hdp-dn29769.mt, executor 5996, partition 136, PROCESS_LOCAL, 8328 bytes)
25/04/29 21:42:08 INFO dispatcher-CoarseGrainedScheduler TaskSetManager: Starting task 137.1 in stage 23.0 (TID 29941, zw06-data-hdp-dn29769.mt, executor 5996, partition 137, PROCESS_LOCAL, 8328 bytes)
25/04/29 21:42:08 INFO dispatcher-CoarseGrainedScheduler YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 4786.
25/04/29 21:42:08 INFO Reporter ApplicationMaster: AppMaster: targetNumExecutors=1400, pendingAllocate=0, runningExecutors=1400. 
25/04/29 21:42:08 INFO dag-scheduler-event-loop DAGScheduler: Executor lost: 4786 (epoch 6)
25/04/29 21:42:08 INFO dispatcher-BlockManagerMaster BlockManagerMasterEndpoint: Trying to remove executor 4786 from BlockManagerMaster.
25/04/29 21:42:08 INFO dispatcher-BlockManagerMaster BlockManagerMasterEndpoint: Removing block manager BlockManagerId(4786, zw06-data-hdp-dn33950.mt, 35009, None)
25/04/29 21:42:08 INFO dag-scheduler-event-loop BlockManagerMaster: Removed 4786 successfully in removeExecutor
25/04/29 21:42:08 INFO Reporter ApplicationMaster: AppMaster: targetNumExecutors=1400, pendingAllocate=0, runningExecutors=1400. 
25/04/29 21:42:08 INFO dispatcher-CoarseGrainedScheduler YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.119.150.13:51114) with ID 5991
25/04/29 21:42:08 INFO spark-listener-group-executorManagement ExecutorMonitor: New executor 5991 has registered (new total is 1368)
25/04/29 21:42:08 ERROR celeborn-dispatcher-110 SparkUtils: Can not get TaskSetManager for taskId: 29935
25/04/29 21:42:08 INFO celeborn-dispatcher-110 LifecycleManager: handle fetch failure for appShuffleId 5 shuffleId 5

Find a new case for this pr.Task 29935 launch on executor 5432.When executor 5432 lost, SparkUtils cannot get TaskSetManager by taskid, because taskSet.removeRunningTask(tid) has been called.So this pr seems useful for this case.The task can throw a fetchfail exception and re run the stage cc @RexXiong @cxzl25

May 06 '25 10:05 buska88

Thanks @buska88 Merge to main(v0.6.0) and also backport to branch-0.5(v0.5.5)

May 13 '25 08:05 RexXiong