ozone
ozone copied to clipboard
HDDS-7309. Enable by default GRPC between S3G and OM
What changes were proposed in this pull request?
After discussions and testing, submitting this patch to enable GRPC server by default between S3G and OM, to improve performance.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-7309
How was this patch tested?
This patch was tested manually.
I would like to see OM peak performance numbers with Hadoop RPC vs grpc. I understand that per client, the performance of grpc is better, but we need to check the server side of the performance as well. Should be a quick test to run.
cc @duongkame
@kerneltime As we said at the apacheCon, we will look at having this number when we have a bit more time, however this should not be a prerequisite to enable that feature by default. We already demonstrate that it increase the number of ops the S3G can handle and the user can still disable it if they want. Enabling it by default will allow to have more people to try it and it to be sure it doesnt break in the future with new pr
e by default. We already demonstrate that it increase the number of ops the S3G can handle and the user can still disable it if they want. Enabling it by default will allow to have more people to try it and it to be sure it doesnt break in the future with new pr
+1. I also think that the full gRpc scalability test is not required to enable the protocol on the server-side by default.
That may be needed when we indeed start using gRpc as the default protocol on the client sides, like S3G or OFS.
e by default. We already demonstrate that it increase the number of ops the S3G can handle and the user can still disable it if they want. Enabling it by default will allow to have more people to try it and it to be sure it doesnt break in the future with new pr
+1. I also think that the full gRpc scalability test is not required to enable the protocol on the server-side by default.
That may be needed when we indeed start using gRpc as the default protocol on the client sides, like S3G or OFS.
Makes sense. I think we should rename the Jira to indicate that this just starts the GRPC server on the OM and does not switch the client in S3G to start using it.
this just starts the GRPC server on the OM and does not switch the client in S3G to start using it.
Maybe I'm missing something, but it seems to me this patch does switch S3 Gateway to use gRPC for talking to OM by default.
OmTransportFactory#createFactory
creates the actual factory as an instance of the (first) class specified in the service descriptor, which is GrpcOmTransportFactory
for S3 Gateway:
https://github.com/apache/ozone/blob/46e58a6350e86f7573fe45678e24c5dab96df45a/hadoop-ozone/s3gateway/src/main/resources/META-INF/services/org.apache.hadoop.ozone.om.protocolPB.OmTransportFactory#L15
if the value configured for ozone.om.transport.class
does not match the value of OMConfigKeys#OZONE_OM_TRANSPORT_CLASS_DEFAULT
, which was and still is:
https://github.com/apache/ozone/blob/46e58a6350e86f7573fe45678e24c5dab96df45a/hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/OMConfigKeys.java#L341-L343
Prior to this patch default value of ozone.om.transport.class
and OMConfigKeys#OZONE_OM_TRANSPORT_CLASS_DEFAULT
did match, so S3 Gateway used the default transport despite its service descriptor. Since the value in ozone-default.xml
is being changed, this condition will now be true by default:
https://github.com/apache/ozone/blob/46e58a6350e86f7573fe45678e24c5dab96df45a/hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/protocolPB/OmTransportFactory.java#L48-L54
So GrpcOmTransportFactory
will be created by default for S3 Gateway, i.e. it will switch to the gRPC client.
From ozone
compose environment with default configs:
s3g_1 | 2022-10-14 19:32:35,747 [qtp661119548-86] INFO protocolPB.OmTransportFactory: ZZZ created custom OmTransportFactory: org.apache.hadoop.ozone.om.protocolPB.GrpcOmTransportFactory@77d1c0db
s3g_1 | 2022-10-14 19:32:35,920 [qtp661119548-86] INFO protocolPB.GrpcOmTransport: GrpcOmTransport: started
The first log message is one I added in OmTransportFactory
.
Also, enabling it only in OM, but not in S3 Gateway, would not have the following benefit:
Enabling it by default will allow to have more people to try it
@adoroszlai is right. We are enabling gRPC for S3G here.
this just starts the GRPC server on the OM and does not switch the client in S3G to start using it.
Maybe I'm missing something, but it seems to me this patch does switch S3 Gateway to use gRPC for talking to OM by default.
OmTransportFactory#createFactory
creates the actual factory as an instance of the (first) class specified in the service descriptor, which isGrpcOmTransportFactory
for S3 Gateway:https://github.com/apache/ozone/blob/46e58a6350e86f7573fe45678e24c5dab96df45a/hadoop-ozone/s3gateway/src/main/resources/META-INF/services/org.apache.hadoop.ozone.om.protocolPB.OmTransportFactory#L15
if the value configured for
ozone.om.transport.class
does not match the value ofOMConfigKeys#OZONE_OM_TRANSPORT_CLASS_DEFAULT
, which was and still is:https://github.com/apache/ozone/blob/46e58a6350e86f7573fe45678e24c5dab96df45a/hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/OMConfigKeys.java#L341-L343
Prior to this patch default value of
ozone.om.transport.class
andOMConfigKeys#OZONE_OM_TRANSPORT_CLASS_DEFAULT
did match, so S3 Gateway used the default transport despite its service descriptor. Since the value inozone-default.xml
is being changed, this condition will now be true by default:https://github.com/apache/ozone/blob/46e58a6350e86f7573fe45678e24c5dab96df45a/hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/protocolPB/OmTransportFactory.java#L48-L54
So
GrpcOmTransportFactory
will be created by default for S3 Gateway, i.e. it will switch to the gRPC client.From
ozone
compose environment with default configs:s3g_1 | 2022-10-14 19:32:35,747 [qtp661119548-86] INFO protocolPB.OmTransportFactory: ZZZ created custom OmTransportFactory: org.apache.hadoop.ozone.om.protocolPB.GrpcOmTransportFactory@77d1c0db s3g_1 | 2022-10-14 19:32:35,920 [qtp661119548-86] INFO protocolPB.GrpcOmTransport: GrpcOmTransport: started
The first log message is one I added in
OmTransportFactory
.Also, enabling it only in OM, but not in S3 Gateway, would not have the following benefit:
Enabling it by default will allow to have more people to try it
Thank you @adoroszlai.
@xBis7 would be possible to get peak OM performance numbers, I would like to validate that the peak OM performance stays the same or is better with GRPC. I understand that each client is seeing better performance but it would be prudent to check the peak performance on OM as well.
Hi @adoroszlai @smengcl , In dev email communication, this PR is required by Ozone 1.3. Currently the release of Ozone-1.3 is blocked on this PR. Could you help continue to review this PR? cc @neils-dev
@captainzmc, there is no need for further review of this PR, it is good for master
, can be merged (based on the assumption that @smengcl's only concern was integration test coverage). Backporting it to 1.3 is a different question.
I don't see dev communication that this is blocking the release. I see a request to include it in the release if possible.
I'm not in favor of making such a change so late in the 1.3 release cycle. Enabling OM gRPC is a matter of simple config change, anyone can do it manually. I think it makes sense to first ship it disabled by default, let users enable on an as-needed basis and provide feedback, then enable in the next release based on that.
I'm not in favor of making such a change so late in the 1.3 release cycle. Enabling OM gRPC is a matter of simple config change, anyone can do it manually.
@adoroszlai Agree with you. In fact, users can change the configuration to decide whether to use grpc. I will remove this issue from the block list in 1.3. Then we can start preparing for 1.3.0-rc0.
Thanks @captainzmc for following up for this PR on the 1.3.0-rc branch and @adoroszlai for comments on that. I did request that this PR be included in the 1.3.0-rc but the comment from @adoroszlai is great, with it merged into the master as the default s3g config and allowing the 1.3.0-rc preparations without it,
@captainzmc, there is no need for further review of this PR, it is good for master, can be merged (based on the assumption that @smengcl's only concern was integration test coverage). ...remove this issue from the block list in 1.3. Then we can start preparing for 1.3.0-rc0.
Hello guys~ looks like this PR is ready to be merged. Maybe we could go ahead to merge?
@DaveTeng0 There is one bit of information we are waiting for, we need to measure the peak OM performance for Hadoop RPC vs GRPC and then this can be merged if there is no regression.
@xBis7 we did some initial tests and OM peak performance is about the same between the 2. Can you rebase this change?
@xBis7 this test error seems to be related, failed in both fork and PR runs:
Error: Tests run: 17, Failures: 0, Errors: 1, Skipped: 1, Time elapsed: 166.658 s <<< FAILURE! - in org.apache.hadoop.ozone.TestSecureOzoneCluster
Error: org.apache.hadoop.ozone.TestSecureOzoneCluster.testOMGrpcServerCertificateRenew Time elapsed: 15.596 s <<< ERROR!
java.io.IOException: Couldn't set up IO streams: java.lang.IllegalArgumentException: Server has invalid Kerberos principal: scm/fv-az210-921.jgpbknbp5ljuxny3rrcz4t2bbd.bx.internal.cloudapp.net@EXAMPLE.COM, expecting: om/fv-az210-921.jgpbknbp5ljuxny3rrcz4t2bbd.bx.internal.cloudapp.net@EXAMPLE.COM
https://github.com/xBis7/ozone/actions/runs/4534441887/jobs/7988627717#step:5:3707 https://github.com/apache/ozone/actions/runs/4534442441/jobs/7988632110#step:5:3707
@adoroszlai Thanks for updating the PR and pointing out the test failure. I'll take a look.
We added this file /hadoop-ozone/integration-test/src/test/resources/META-INF/services/org.apache.hadoop.ozone.om.protocolPB.OmTransportFactory
under integration-test
which was forcing tests to use Hadoop RPC despite their configurations. The test that was failing, needed to use gRPC and had been configured to do so but this file was overriding the setting.
I've removed it and run a workflow on a dummy branch successfully. https://github.com/xBis7/ozone/actions/runs/4544882681
Thanks @xBis7 for the patch, @duongkame, @kerneltime, @smengcl for the review.