ozone icon indicating copy to clipboard operation
ozone copied to clipboard

HDDS-6743. Specify leader node for OM failover

Open symious opened this issue 3 years ago • 3 comments

What changes were proposed in this pull request?

Currently if clients first connect to a follower OM, the response show the OM is not leader but didn't specify the real Leader node.

This ticket is to let the reply to contains the Leader OM so that clients can connect to Leader node more conveniently.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-6743

How was this patch tested?

unit test

symious avatar May 13 '22 07:05 symious

@adoroszlai @ChenSammi Could you help to review this PR?

symious avatar May 13 '22 07:05 symious

thanks @symious for the work! i have a patch for this issue earlier #2765

@hanishakoneru left a comment to explain why this should not be done. https://github.com/apache/ozone/pull/2765#issuecomment-952091699

i suggest we should achieve agreement on this issue first , and then go ahead.

JacksonYao287 avatar May 13 '22 10:05 JacksonYao287

@JacksonYao287 Sure, thanks for the review.

In https://github.com/apache/ozone/pull/2765#issuecomment-952091699, the concern I think is the misconfig of client side might trigger some dead loops, so an address was prefered to add instead of only OMNodeId.

In the latest commit of this PR, the OMNotLeaderException includes the following information:

  1. raftPeerId
  2. raftLeaderId
  3. raftLeaderAddress

An example of this exception message would be org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException: OM:omNode-3 is not the leader. Suggested leader is OM:omNode-1/127.0.0.1 , when client received this exception, he should try the address first, only if the address is empty should he try to check the raftLeaderId we suggested.

symious avatar May 13 '22 11:05 symious

@symious is this PR still active? If not we can close it.

kerneltime avatar Oct 31 '22 16:10 kerneltime

Just saw this PR, recently I've also been researching some issue related to the out-of-sync mapping between client and server. just mark myself here in order to follow up the latest change of this PR! thanks all!

DaveTeng0 avatar Nov 01 '22 05:11 DaveTeng0

@kerneltime Still active I think, could you help to review the PR? I will resolve the conflictions later.

symious avatar Nov 01 '22 06:11 symious

thanks @symious will get this reviewed

kerneltime avatar Nov 04 '22 08:11 kerneltime

cc @duongkame @aswinshakil @tanvipenumudy

kerneltime avatar Nov 04 '22 08:11 kerneltime