ozone HDDS-9279. OM HA: support read from followers.

What changes were proposed in this pull request?

Ratis has a new Linearizable Read (RATIS-1557) feature, including reading from the followers. In this JIRA, we will change OM to serve read requests from any OM servers, including the follower OMs.

What is the link to the Apache JIRA

HDDS-9279

How was this patch tested?

(Please explain how this patch was tested. Ex: unit tests, manual tests) (If this patch involves UI changes, please attach a screen-shot; otherwise, remove this)

Will add new tests.

Sep 13 '23 20:09 szetszwo

My immediate thought on this, without fully understanding the details, is that this is a major change in behavior and should probably be off by default until is sees more testing.

Is it possible for some followers to fall behind - if so how is this handled?

Or are the followers always totally up to date? Eg, is a commit key on the leader absolutely available on all followers when the client call to commit the key returns, so that if a client immediately issues a read and it hits a different OM, it will see the data?

Is there a performance implication for writes by setting this on? Eg in the old way, as the followers slightly behind as they apply the Ratis log, while with this on its more like a 3 way commit for each write, so a delay (eg GC pause) on one OM would affect the write time on the leader?

Sep 13 '23 21:09 sodonnel

cc @tanvipenumudy @muskan1012

Sep 14 '23 05:09 kerneltime

Thanks a lot @szetszwo for working on this.

Will add new tests.

Can we mark as "draft" until then?

Sep 14 '23 08:09 adoroszlai

... should probably be off by default until is sees more testing.

@sodonnel , thanks for taking a look! Sure, let's disable it by default.

Is it possible for some followers to fall behind - if so how is this handled?

Or are the followers always totally up to date? Eg, is a commit key on the leader absolutely available on all followers when the client call to commit the key returns, so that if a client immediately issues a read and it hits a different OM, it will see the data?

In short, the read-index algorithm handle it. For more details, see the design doc in RATIS-1557 and also the Raft thesis, Section 6.4

Is there a performance implication for writes by setting this on? Eg in the old way, as the followers slightly behind as they apply the Ratis log, while with this on its more like a 3 way commit for each write, so a delay (eg GC pause) on one OM would affect the write time on the leader?

The writes are the same. The feature only change the behavior for read.

Sep 14 '23 20:09 szetszwo

Can we mark as "draft" until then?

@adoroszlai , Done.

Sep 14 '23 20:09 szetszwo

I don't have any real understanding of Ratis or how it is applied to OM HA, so its hard for me to understand how this would work.

For an OM write - the write updates the leader OM data (cache / RocksDB) and then writes the transaction to Ratis before the call returns to the client. For this Ratis write to succeed, must it make it onto the other 2 OM nodes and into their Ratis log or just into a majority of the Ratis logs?

When the Ratis transaction is received by the follower OM, what happens before that call returns to the leader who called it? Is the transaction written to the follower Ratis log AND applied to the follower memory state too before the Ozone client returns? Or is the Ratis log applied to the followers memory async by a separate thread, meaning the original client call returns BEFORE the memory state is updated in all OMs?

If the original client call doesn't return until all 3 OMs have been updated, does this mean the 3 OMs have a strictly consistent memory state, rather than eventually consistent?

Sep 14 '23 21:09 sodonnel

... For this Ratis write to succeed, must it make it onto the other 2 OM nodes and into their Ratis log or just into a majority of the Ratis logs?

Only one Follower is needed, i.e. 2 OMs (including the Leader) out of 3 OMs are needed.

When the Ratis transaction is received by the follower OM, what happens before that call returns to the leader who called it? Is the transaction written to the follower Ratis log AND applied to the follower memory state too before the Ozone client returns? Or is the Ratis log applied to the followers memory async by a separate thread, meaning the original client call returns BEFORE the memory state is updated in all OMs?

When the Leader replies to the client, the transaction must be committed (i.e. replicated to one follower) and it is applied at the Leader. The follower may not have applied it.

The read index algorithm has the following steps:

a client sends a read request to a follower,
the follower asks the Leader the current commit index, say 10 -- this is the read index
the follower may only have applied log to a small index, say 8.
the follower will just wait until its applied index advances to 10.
the follower replies the read request.

Sep 16 '23 21:09 szetszwo

Maybe we can rerun some subset of the tests after changing the config to LINEARIZABLE.

Oct 02 '23 16:10 kerneltime

I think this is a good change to commit, as it gets us read from standby almost for free. But I do have some concerns - eg if a follower is struggling and requests hit it and are always also because it is always behind etc. I know a lot of thought went into this sort of thing with HDFS, but my memory of it is too old now to remember any of the details.

Oct 02 '23 16:10 sodonnel

Hi, I would like to inquire about a question. I did a simple test by running hadoop fs -ls ofs://follower/xxx and found that it can be accessed via the follower. But I attempt to access it via 'serverId' using ofs://serverId/xxx, it consistently hits the leader (from audit log). So under what conditions will it be forwarded to the follower? Or is it random?

Feb 01 '24 12:02 whbing

... I did a simple test by running ...

@whbing, do you mean that you had applied this change and then ran the test?

But I attempt to access it via 'serverId' using ofs://serverId/xxx, ...

I guess the serverId was translated to the pipeline somewhere at the client side. Then, the client use the pipeline contacting the leader.

Feb 01 '24 16:02 szetszwo

... I did a simple test by running ...

@whbing, do you mean that you had applied this change and then ran the test?

But I attempt to access it via 'serverId' using ofs://serverId/xxx, ...

I guess the serverId was translated to the pipeline somewhere at the client side. Then, the client use the pipeline contacting the leader.

I applied the pr and test in my cluster. I'm sorry that the above description is not accurate. It is not hit leader every time, but the first one in the ozone-om.nodes.xxx in client side conf. For example, om2 will always be selected to read if configured <name>ozone.om.nodes.myozone</name><value>om2,om0,om1</value>

Feb 02 '24 04:02 whbing

... the first one in the ozone-om.nodes.xxx in client side conf. ...

@whbing , it looks like that the current code always use the first OM on the list. We should choose the closest OM or randomize it.

Feb 06 '24 21:02 szetszwo

@whbing , pushed a change to shuffle omNodeIDList.

Feb 07 '24 18:02 szetszwo

@whbing , pushed a change to shuffle omNodeIDList.

Thanks for the update, and I have tested that it is OK

Feb 22 '24 07:02 whbing

I have another consultation. I noticed that follwer's 9862 RPC port would restart during running. Related ticket HDDS-10177. Not sure what the impact is on follwer reads during rpc port restart.

Mar 20 '24 07:03 whbing

... Not sure what the impact is on follwer reads during rpc port restart.

The client will fail and retry. It seems okay if the client can failover to other datanodes. We should test it.

Mar 25 '24 17:03 szetszwo

@whbing , @ivandika3 , I am currently not able to continue this work. Are you interested in working on this? Please feel free to let me know.

Mar 25 '24 18:03 szetszwo

@szetszwo We test this feature with a revised version of this PR, https://github.com/symious/ozone/tree/HDDS-9279, but the performance improvement is not very satisfactory.

Apr 03 '24 03:04 symious

@whbing , @ivandika3 , I am currently not able to continue this work. Are you interested in working on this? Please feel free to let me know.

@szetszwo Thanks! I'm not an expert on ratis. As far as I know it is almost complete, I can help to add some tests to this pr later.

Apr 03 '24 04:04 whbing

@szetszwo We test this feature with a revised version of this PR, https://github.com/symious/ozone/tree/HDDS-9279, but the performance improvement is not very satisfactory.

Hi, @symious, What about the performance of your tests? I also ran some performance tests, see https://docs.google.com/document/d/1xVkaQYDXJmztETJVZQHkkij_j8j6MGQ4XB8ehathhG8/edit#heading=h.o61uifuxltgn

Apr 03 '24 05:04 whbing

What about the performance of your tests?

About 20% performance improvement.

Apr 03 '24 06:04 symious

..., We test this feature with a revised version of this PR, https://github.com/symious/ozone/tree/HDDS-9279, ...

@symious , I checked the PR, the code looks good.

... but the performance improvement is not very satisfactory.

How did you test it? Could you share the results?

Suppose the test only has read requests but no write requests (i.e. the commit index won't change). Then, the ops should have ~3x improvement since each OM should serve 1/3 of the requests.

The latency may have much smaller or no improvement since

the readIndex algorithm has additional overhead, and
without enabling LINEARIZABLE, the read is NOT linearizable even it reads from the Leader.

Apr 03 '24 18:04 szetszwo

... I also ran some performance tests, ...

@whbing , thanks for testing the performance and sharing the results. I commented on the doc. My major comment is that a single freon command may not be able to test the performance correctly since the limitation may be at the client side. In other words, using a single client machine to test the performance of multiple server machines does not sound right.

We should try running the commands in multiple client machines.

Apr 03 '24 18:04 szetszwo

BTW, updating Ratis may help. Deployed a 3.1.0-c73a3eb-SNAPSHOT release https://repository.apache.org/content/groups/snapshots/org/apache/ratis/ratis-common/3.1.0-c73a3eb-SNAPSHOT/ , see if you could try it.

Apr 03 '24 18:04 szetszwo

Another observation is that the benchmark with writes will eventually converge to the leader, which makes most traffic to hit the leader eventually for persistent client (e.g. Ozone client in S3G).

Example case

1: READ -> follower 1 2: WRITE -> follower 1 will throw NotLeaderException, redirect to leader 3: subsequent READs -> leader (until the leader fails)

One possible way is to ensure READ is sent to the follower, is to modify the client (OmFailoverProxyProviderBase) keep track of the current leader and followers of the OM service. All the read requests will be sent to the follower, while the write requests will be sent to the leader.

We might adapt some of the logic from HDFS's ObserverReadProxyProvider.

Apr 04 '24 01:04 ivandika3

FYI: I created a comment about possible considerations around OM follower read in the ticket (https://issues.apache.org/jira/browse/HDDS-9279). Hopefully this will highlight some ideas around follower read.

Apr 04 '24 02:04 ivandika3

... with a revised version of this PR, https://github.com/symious/ozone/tree/HDDS-9279 ...

@symious Will you encounter any errors based on this code? I encounter the following errors many times when running ozone freon ockrw -r 1000 -t 3 --linear --contiguous --duration 5s --size 4096 -v s3v -b $bucket -p $prefix with ozone.om.ratis.server.read.option=LINEARIZABLE

Total execution time (sec): 7
Failures: 0
Successful executions: 0

java.lang.NullPointerException
        at org.apache.hadoop.ozone.om.helpers.OMRatisHelper.getOMResponseFromRaftClientReply(OMRatisHelper.java:69)
        at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.createOmResponseImpl(OzoneManagerRatisServer.java:532)
        at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.lambda$2(OzoneManagerRatisServer.java:289)
        at org.apache.hadoop.util.MetricUtil.captureLatencyNs(MetricUtil.java:45)
        at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.createOmResponse(OzoneManagerRatisServer.java:287)
        at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequest(OzoneManagerRatisServer.java:267)
        at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToRatis(OzoneManagerProtocolServerSideTranslatorPB.java:260)
        at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:267)
        at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.internalProcessRequest(OzoneManagerProtocolServerSideTranslatorPB.java:201)
        at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:161)
        at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89)
        at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:152)
        at org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:484)
        at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:595)
        at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1094)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1017)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3048)

reply.getMessage() is null.

Apr 09 '24 13:04 whbing

BTW, updating Ratis may help. Deployed a 3.1.0-c73a3eb-SNAPSHOT release https://repository.apache.org/content/groups/snapshots/org/apache/ratis/ratis-common/3.1.0-c73a3eb-SNAPSHOT/ , see if you could try it.

OM is not yet compatible with ratis 3.1.0. such as RATIS-2011. (OM will also start failed with the above jar). But this pr doesn't look like it needs 3.1.0 and can be skipped for now.

Apr 09 '24 13:04 whbing

reply.getMessage() is null.

Can test on 1.4.0 first.

Apr 09 '24 14:04 symious