ozone icon indicating copy to clipboard operation
ozone copied to clipboard

HDDS-7309. Enable by default GRPC between S3G and OM

Open xBis7 opened this issue 2 years ago • 8 comments

What changes were proposed in this pull request?

After discussions and testing, submitting this patch to enable GRPC server by default between S3G and OM, to improve performance.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-7309

How was this patch tested?

This patch was tested manually.

xBis7 avatar Oct 11 '22 13:10 xBis7

I would like to see OM peak performance numbers with Hadoop RPC vs grpc. I understand that per client, the performance of grpc is better, but we need to check the server side of the performance as well. Should be a quick test to run.

kerneltime avatar Oct 12 '22 06:10 kerneltime

cc @duongkame

kerneltime avatar Oct 12 '22 06:10 kerneltime

@kerneltime As we said at the apacheCon, we will look at having this number when we have a bit more time, however this should not be a prerequisite to enable that feature by default. We already demonstrate that it increase the number of ops the S3G can handle and the user can still disable it if they want. Enabling it by default will allow to have more people to try it and it to be sure it doesnt break in the future with new pr

michelsumbul avatar Oct 12 '22 15:10 michelsumbul

e by default. We already demonstrate that it increase the number of ops the S3G can handle and the user can still disable it if they want. Enabling it by default will allow to have more people to try it and it to be sure it doesnt break in the future with new pr

+1. I also think that the full gRpc scalability test is not required to enable the protocol on the server-side by default.

That may be needed when we indeed start using gRpc as the default protocol on the client sides, like S3G or OFS.

duongkame avatar Oct 13 '22 20:10 duongkame

e by default. We already demonstrate that it increase the number of ops the S3G can handle and the user can still disable it if they want. Enabling it by default will allow to have more people to try it and it to be sure it doesnt break in the future with new pr

+1. I also think that the full gRpc scalability test is not required to enable the protocol on the server-side by default.

That may be needed when we indeed start using gRpc as the default protocol on the client sides, like S3G or OFS.

Makes sense. I think we should rename the Jira to indicate that this just starts the GRPC server on the OM and does not switch the client in S3G to start using it.

kerneltime avatar Oct 14 '22 00:10 kerneltime

this just starts the GRPC server on the OM and does not switch the client in S3G to start using it.

Maybe I'm missing something, but it seems to me this patch does switch S3 Gateway to use gRPC for talking to OM by default.

OmTransportFactory#createFactory creates the actual factory as an instance of the (first) class specified in the service descriptor, which is GrpcOmTransportFactory for S3 Gateway:

https://github.com/apache/ozone/blob/46e58a6350e86f7573fe45678e24c5dab96df45a/hadoop-ozone/s3gateway/src/main/resources/META-INF/services/org.apache.hadoop.ozone.om.protocolPB.OmTransportFactory#L15

if the value configured for ozone.om.transport.class does not match the value of OMConfigKeys#OZONE_OM_TRANSPORT_CLASS_DEFAULT, which was and still is:

https://github.com/apache/ozone/blob/46e58a6350e86f7573fe45678e24c5dab96df45a/hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/OMConfigKeys.java#L341-L343

Prior to this patch default value of ozone.om.transport.class and OMConfigKeys#OZONE_OM_TRANSPORT_CLASS_DEFAULT did match, so S3 Gateway used the default transport despite its service descriptor. Since the value in ozone-default.xml is being changed, this condition will now be true by default:

https://github.com/apache/ozone/blob/46e58a6350e86f7573fe45678e24c5dab96df45a/hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/protocolPB/OmTransportFactory.java#L48-L54

So GrpcOmTransportFactory will be created by default for S3 Gateway, i.e. it will switch to the gRPC client.

From ozone compose environment with default configs:

s3g_1       | 2022-10-14 19:32:35,747 [qtp661119548-86] INFO protocolPB.OmTransportFactory: ZZZ created custom OmTransportFactory: org.apache.hadoop.ozone.om.protocolPB.GrpcOmTransportFactory@77d1c0db
s3g_1       | 2022-10-14 19:32:35,920 [qtp661119548-86] INFO protocolPB.GrpcOmTransport: GrpcOmTransport: started

The first log message is one I added in OmTransportFactory.

Also, enabling it only in OM, but not in S3 Gateway, would not have the following benefit:

Enabling it by default will allow to have more people to try it

adoroszlai avatar Oct 14 '22 19:10 adoroszlai

@adoroszlai is right. We are enabling gRPC for S3G here.

xBis7 avatar Oct 17 '22 15:10 xBis7

this just starts the GRPC server on the OM and does not switch the client in S3G to start using it.

Maybe I'm missing something, but it seems to me this patch does switch S3 Gateway to use gRPC for talking to OM by default.

OmTransportFactory#createFactory creates the actual factory as an instance of the (first) class specified in the service descriptor, which is GrpcOmTransportFactory for S3 Gateway:

https://github.com/apache/ozone/blob/46e58a6350e86f7573fe45678e24c5dab96df45a/hadoop-ozone/s3gateway/src/main/resources/META-INF/services/org.apache.hadoop.ozone.om.protocolPB.OmTransportFactory#L15

if the value configured for ozone.om.transport.class does not match the value of OMConfigKeys#OZONE_OM_TRANSPORT_CLASS_DEFAULT, which was and still is:

https://github.com/apache/ozone/blob/46e58a6350e86f7573fe45678e24c5dab96df45a/hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/OMConfigKeys.java#L341-L343

Prior to this patch default value of ozone.om.transport.class and OMConfigKeys#OZONE_OM_TRANSPORT_CLASS_DEFAULT did match, so S3 Gateway used the default transport despite its service descriptor. Since the value in ozone-default.xml is being changed, this condition will now be true by default:

https://github.com/apache/ozone/blob/46e58a6350e86f7573fe45678e24c5dab96df45a/hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/protocolPB/OmTransportFactory.java#L48-L54

So GrpcOmTransportFactory will be created by default for S3 Gateway, i.e. it will switch to the gRPC client.

From ozone compose environment with default configs:

s3g_1       | 2022-10-14 19:32:35,747 [qtp661119548-86] INFO protocolPB.OmTransportFactory: ZZZ created custom OmTransportFactory: org.apache.hadoop.ozone.om.protocolPB.GrpcOmTransportFactory@77d1c0db
s3g_1       | 2022-10-14 19:32:35,920 [qtp661119548-86] INFO protocolPB.GrpcOmTransport: GrpcOmTransport: started

The first log message is one I added in OmTransportFactory.

Also, enabling it only in OM, but not in S3 Gateway, would not have the following benefit:

Enabling it by default will allow to have more people to try it

Thank you @adoroszlai.

@xBis7 would be possible to get peak OM performance numbers, I would like to validate that the peak OM performance stays the same or is better with GRPC. I understand that each client is seeing better performance but it would be prudent to check the peak performance on OM as well.

kerneltime avatar Oct 17 '22 16:10 kerneltime

Hi @adoroszlai @smengcl , In dev email communication, this PR is required by Ozone 1.3. Currently the release of Ozone-1.3 is blocked on this PR. Could you help continue to review this PR? cc @neils-dev

captainzmc avatar Nov 09 '22 02:11 captainzmc

@captainzmc, there is no need for further review of this PR, it is good for master, can be merged (based on the assumption that @smengcl's only concern was integration test coverage). Backporting it to 1.3 is a different question.

I don't see dev communication that this is blocking the release. I see a request to include it in the release if possible.

I'm not in favor of making such a change so late in the 1.3 release cycle. Enabling OM gRPC is a matter of simple config change, anyone can do it manually. I think it makes sense to first ship it disabled by default, let users enable on an as-needed basis and provide feedback, then enable in the next release based on that.

adoroszlai avatar Nov 09 '22 06:11 adoroszlai

I'm not in favor of making such a change so late in the 1.3 release cycle. Enabling OM gRPC is a matter of simple config change, anyone can do it manually.

@adoroszlai Agree with you. In fact, users can change the configuration to decide whether to use grpc. I will remove this issue from the block list in 1.3. Then we can start preparing for 1.3.0-rc0.

captainzmc avatar Nov 09 '22 08:11 captainzmc

Thanks @captainzmc for following up for this PR on the 1.3.0-rc branch and @adoroszlai for comments on that. I did request that this PR be included in the 1.3.0-rc but the comment from @adoroszlai is great, with it merged into the master as the default s3g config and allowing the 1.3.0-rc preparations without it,

@captainzmc, there is no need for further review of this PR, it is good for master, can be merged (based on the assumption that @smengcl's only concern was integration test coverage). ...remove this issue from the block list in 1.3. Then we can start preparing for 1.3.0-rc0.

neils-dev avatar Nov 09 '22 14:11 neils-dev

Hello guys~ looks like this PR is ready to be merged. Maybe we could go ahead to merge?

DaveTeng0 avatar Jan 30 '23 06:01 DaveTeng0

@DaveTeng0 There is one bit of information we are waiting for, we need to measure the peak OM performance for Hadoop RPC vs GRPC and then this can be merged if there is no regression.

kerneltime avatar Jan 30 '23 06:01 kerneltime

@xBis7 we did some initial tests and OM peak performance is about the same between the 2. Can you rebase this change?

kerneltime avatar Mar 27 '23 16:03 kerneltime

@xBis7 this test error seems to be related, failed in both fork and PR runs:

Error:  Tests run: 17, Failures: 0, Errors: 1, Skipped: 1, Time elapsed: 166.658 s <<< FAILURE! - in org.apache.hadoop.ozone.TestSecureOzoneCluster
Error:  org.apache.hadoop.ozone.TestSecureOzoneCluster.testOMGrpcServerCertificateRenew  Time elapsed: 15.596 s  <<< ERROR!
java.io.IOException: Couldn't set up IO streams: java.lang.IllegalArgumentException: Server has invalid Kerberos principal: scm/fv-az210-921.jgpbknbp5ljuxny3rrcz4t2bbd.bx.internal.cloudapp.net@EXAMPLE.COM, expecting: om/fv-az210-921.jgpbknbp5ljuxny3rrcz4t2bbd.bx.internal.cloudapp.net@EXAMPLE.COM

https://github.com/xBis7/ozone/actions/runs/4534441887/jobs/7988627717#step:5:3707 https://github.com/apache/ozone/actions/runs/4534442441/jobs/7988632110#step:5:3707

adoroszlai avatar Mar 27 '23 18:03 adoroszlai

@adoroszlai Thanks for updating the PR and pointing out the test failure. I'll take a look.

xBis7 avatar Mar 27 '23 18:03 xBis7

We added this file /hadoop-ozone/integration-test/src/test/resources/META-INF/services/org.apache.hadoop.ozone.om.protocolPB.OmTransportFactory under integration-test which was forcing tests to use Hadoop RPC despite their configurations. The test that was failing, needed to use gRPC and had been configured to do so but this file was overriding the setting.

I've removed it and run a workflow on a dummy branch successfully. https://github.com/xBis7/ozone/actions/runs/4544882681

xBis7 avatar Mar 28 '23 18:03 xBis7

Thanks @xBis7 for the patch, @duongkame, @kerneltime, @smengcl for the review.

adoroszlai avatar Apr 19 '23 09:04 adoroszlai