Junfan Zhang comments

Results 434 comments of


                                            Junfan Zhang

[#1579][part-1] fix(spark): Adjust reassigned time to ensure that all previous data is cleared for stage retry

After rethinking this, I think the `reassignAllShuffleServersForWholeStage` could be invoked by the retry writer rather than previous failed writer that could ensure no older data into server after re-register.

[#1579][part-1] fix(spark): Adjust reassigned time to ensure that all previous data is cleared for stage retry

> It's dangerous to delete the failed data of the stage when we retry. It's hard to reach the condition to delete the data. Could you describe more?

[#1579][part-1] fix(spark): Adjust reassigned time to ensure that all previous data is cleared for stage retry

> > > It's dangerous to delete the failed data of the stage when we retry. It's hard to reach the condition to delete the data. > > > >...

[#1579][part-1] fix(spark): Adjust reassigned time to ensure that all previous data is cleared for stage retry

Could you help review this? @EnricoMi @jerqi spark2 change will be finished after this PR is OK for you

[#1579][part-1] fix(spark): Adjust reassigned time to ensure that all previous data is cleared for stage retry

> 1. How to reject the legacy requests? Using the latest attemtp id in server side to check whether the send request is valid with the older version, this will...

[#1579][part-1] fix(spark): Adjust reassigned time to ensure that all previous data is cleared for stage retry

> Can we register a shuffle as the tuple `(shuffle_id, stage_attempt_id)`? This way, we do not need to wait for `(shuffle_id, 0)` to be be deleted synchronously, and can go...

[#1579][part-1] fix(spark): Adjust reassigned time to ensure that all previous data is cleared for stage retry

> Spark client can easily come up with a per-stage-attempt shuffle id and feed that to the shuffle server. That should not require any server-side refactoring. Thanks for your review....

[#1579][part-1] fix(spark): Adjust reassigned time to ensure that all previous data is cleared for stage retry

> > > > Spark client can easily come up with a per-stage-attempt shuffle id and feed that to the shuffle server. That should not require any server-side refactoring. >...

[#1579][part-1] fix(spark): Adjust reassigned time to ensure that all previous data is cleared for stage retry

> > > If we make the unique shuffleIdWithAttemptNo generated or converted in server side > > > > > > I presume the server side does not know about...

[#1579][part-1] fix(spark): Adjust reassigned time to ensure that all previous data is cleared for stage retry

@yl09099 will take over this and update PR here.