[FEATURE] Make metadata of end-to-end integrity validation stored in the shuffle-server side

Open zuston opened this issue 3 months ago • 1 comments

By leveraging the PR #2653 , we could end-to-end ensure the data consistency. But, the partition stats stored in the spark driver side, for the normal spark stages, this design runs well. But with the 100000 tasks with 10000 partitions, this will make the Spark driver overload. From the point of cluster spark jobs, some huge jobs will hang when getting the blockManagerIds, that will cost almost 20mins for one reader task, that is unacceptable.

And so we should introduce the extra mechanism to store the metadata in the shuffle-server side.

Nov 14 '25 05:11 zuston

The second step is to implement the server side logic to accept and return back the partition stats. If anyone have interest on this, feel free to take this.

Nov 14 '25 06:11 zuston