Ethan Feng comments

Results 39 comments of


                                            Ethan Feng

[CELEBORN-1549] Fix networkLocation persistence into Ratis

If the rack configs are changed and the masters are restarted but workers didn't. This pr will get the wrong network locations until the workers are lost and register again.

[CELEBORN-1549] Fix networkLocation persistence into Ratis

In stress testing, did the master fail to resolve network locations?

[CELEBORN-1582] Publish metric for unreleased shuffle count when worker was decommissioned

Merged into main(v0.6.0).

[CELEBORN-1599] Container Info REST API

> ping @FMX @SteNicholas @waitinfuture @turboFei PTAL thanks. I am not sure regarding the dependency check issues failing above, I ran `./dev/dependencies.sh --replace` , but it is still failing. Is...

[CELEBORN-1599] Container Info REST API

Thanks. Merged into main(v0.6.0).

[CELEBORN-1855] LifecycleManager return appshuffleId for non barrier stage when fetch fail has been reported

According to your Jira ticket, "that shuffle fetch fails does not lead to stage fail because task speculation and another attempts succeed", I think the quoted scenario should not happen...

[CELEBORN-1855] LifecycleManager return appshuffleId for non barrier stage when fetch fail has been reported

> > According to your Jira ticket, "that shuffle fetch fails does not lead to stage fail because task speculation and another attempts succeed", I think the quoted scenario should...

[CELEBORN-1855] LifecycleManager return appshuffleId for non barrier stage when fetch fail has been reported

> > > > According to your Jira ticket, "that shuffle fetch fails does not lead to stage fail because task speculation and another attempts succeed", I think the quoted...

[CELEBORN-1855] LifecycleManager return appshuffleId for non barrier stage when fetch fail has been reported

@buska88 It would be better to check the validity of a shuffle Id after you get it instead of using a shuffle ID that marked as invalid. You can add...

Merge Resource.proto into TransportMessages.proto

I've updated the protobuf names, and there are more changes needed in this PR. You should change `HAHelper.convertByteStringToRequest` and `HAHelper.convertRequestToByteString` to use the meta request structures. As we have discussed...