celeborn icon indicating copy to clipboard operation
celeborn copied to clipboard

[CELEBORN-1549] Fix networkLocation persistence into Ratis

Open akpatnam25 opened this issue 1 year ago • 9 comments

What changes were proposed in this pull request?

Fixing a bug where the networkLocation is not persisted in Ratis, and the master defaults to DEFAULT_RACK when it loads the snapshot. This was missed in https://github.com/apache/celeborn/pull/2367 unfortunately, and it came up during our stress testing internally.

Why are the changes needed?

Needed for custom network aware replication, so that networkLocation state is kept in snapshot file.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Updated unit test to ensure serde is correct.

akpatnam25 avatar Aug 07 '24 06:08 akpatnam25

cc @mridulm @waitinfuture @FMX @SteNicholas @AngersZhuuuu @pan3793

akpatnam25 avatar Aug 07 '24 08:08 akpatnam25

@akpatnam25, no any document update?

SteNicholas avatar Aug 07 '24 09:08 SteNicholas

@akpatnam25, no any document update?

No I don't think there is any document to update @SteNicholas

akpatnam25 avatar Aug 07 '24 10:08 akpatnam25

can we merge this PR if it is good to go? :) @SteNicholas

akpatnam25 avatar Aug 08 '24 17:08 akpatnam25

Just hold on a second. The field is skipped for a reason. Let me find out why the field is not included in the proto.

FMX avatar Aug 09 '24 03:08 FMX

截屏2024-08-09 11 42 24 org.apache.celeborn.service.deploy.master.clustermeta.AbstractMetaManager#restoreMetaFromFile

Abstract meta manager will try to resolve workerinfos who are "DEFAULT_RACK".

FMX avatar Aug 09 '24 03:08 FMX

截屏2024-08-09 11 42 24 org.apache.celeborn.service.deploy.master.clustermeta.AbstractMetaManager#restoreMetaFromFile Abstract meta manager will try to resolve workerinfos who are "DEFAULT_RACK".

@FMX yes that is correct. However, this can be something different based on what is passed in to the workerInfo from the workers (see this PR). If that resolution on the worker side does not take place, master can try to resolve when it restores the snapshot.

akpatnam25 avatar Aug 09 '24 04:08 akpatnam25

@FMX essentially we need this whenever the networkLocation != DEFAULT_RACK. In this case, master should restore the snapshot with that custom network location. hope this makes sense

akpatnam25 avatar Aug 09 '24 04:08 akpatnam25

This is essentially something we missed in this PR: https://github.com/apache/celeborn/pull/2367

Otherwise network location is never persisted into ratis.

akpatnam25 avatar Aug 09 '24 06:08 akpatnam25

If the rack configs are changed and the masters are restarted but workers didn't. This pr will get the wrong network locations until the workers are lost and register again.

FMX avatar Aug 13 '24 03:08 FMX

In stress testing, did the master fail to resolve network locations?

FMX avatar Aug 13 '24 03:08 FMX

Thanks @FMX!! Yes, I understand the concern. This is basically only since our rack configs will never change. I described more below :slightly_smiling_face: In our environment, there are 2 ways for the master to know the networkLocation:

  • Query external REST API for each worker to get the networkLocation -- this is an expensive call for 400+ workers
  • Worker sends its networkLocation as a part of the registration process

Given this, we chose the 2nd approach above. Essentially, when a worker sends its networkLocation, that networkLocation will never change for us. The current logic in the code is that if the networkLocation upon reading the snapshot is default, then it will again try to query the external API at the Master again once more (this is not a desired code path, but is basically the backup in case the networkLocation is null in the snapshot). This means that networkLocation must be persisted into Ratis which is basically the cause of this PR. cc @mridulm

akpatnam25 avatar Aug 13 '24 03:08 akpatnam25

Even if rack configs changed, all masters would never be restarted all at the same time anyways in a production setting, so this would cause split brain anyways for a period of time, which would not be good for the cluster. Let me know what you think @FMX

akpatnam25 avatar Aug 13 '24 03:08 akpatnam25

@FMX I updated this PR to have a config to persist the networkLocation as we spoke offline. I think the failing test is unrelated to this PR, let me know if you think otherwise.

akpatnam25 avatar Aug 13 '24 20:08 akpatnam25

thanks @FMX and everyone else for the reviews!! :)

akpatnam25 avatar Aug 14 '24 05:08 akpatnam25