ozone
ozone copied to clipboard
HDDS-6312. Use KeyPrefixContainer table to accelerate the process of DELETE/UPDATE events
What changes were proposed in this pull request?
Recon stores the mapping of ContainerKeyPrefix in local RocksDB. When Recon is applying DELETE or UPDATE events from OM, it will run search the whole table for each to_be_deleted record.
In a big cluster, the record count in this table could be very large, and the search loop for each records is very slow. In our cluster there are 90m records, each loop cost over 70 seconds, if a delta OM events have 100 DELETE or UPDATE events, it will took about two hours to apply these updates.
This ticket is to accelerate the process with the help of a new local table.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-6312
How was this patch tested?
unit test.
@adoroszlai @avijayanhwx @ferhui Could you help to review this PR?
@JacksonYao287 Could you please help review?
@avijayanhwx Could you help to check this issue? This issue can be a problem if we want to retrieve updated information from OM.
In a small cluster, the connection from Recon to OM can be quite normal, with 10 minutes' interval. Please ignore the content in the following block, but each request is from Recon to OM to get delta updates.
2022-04-06 09:41:57,273 [Socket Reader #1 for port 9862] WARN org.apache.hadoop.ipc.Server: Connection Authentication from Recon:56662 for protocol org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol is failed for user ozone (auth:SIMPLE).
2022-04-06 09:51:57,357 [Socket Reader #1 for port 9862] WARN org.apache.hadoop.ipc.Server: Connection Authentication from Recon:60932 for protocol org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol is failed for user ozone (auth:SIMPLE).
2022-04-06 10:01:57,436 [Socket Reader #1 for port 9862] WARN org.apache.hadoop.ipc.Server: Connection Authentication from Recon:63884 for protocol org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol is failed for user ozone (auth:SIMPLE).
2022-04-06 10:11:57,544 [Socket Reader #1 for port 9862] WARN org.apache.hadoop.ipc.Server: Connection Authentication from Recon:24028 for protocol org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol is failed for user ozone (auth:SIMPLE).
2022-04-06 10:22:04,916 [Socket Reader #1 for port 9862] WARN org.apache.hadoop.ipc.Server: Connection Authentication from Recon:26968 for protocol org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol is failed for user ozone (auth:SIMPLE).
But in a big cluster, since the process is too slow, this connection can be far longer then 10min's interval.
Thanks @symious for the patch and sorry for the long delay in review. Can you please resolve merge conflicts?
@adoroszlai Thanks for the review. Updated the patch, please have a check.
Thanks @symious for the patch. There's another patch causing minor conflicts. Would you merge latest master to the PR branch again?
I have resolved the conflict on my branch (I tried to push to your PR branch but got no permission), you can take this as a reference: https://github.com/smengcl/hadoop-ozone/blob/HDDS-6312/hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/tasks/ContainerKeyMapperTask.java#L249-L275
@smengcl Updated the PR, please have a look.
@smengcl Updated the patch, please have a look.
@smengcl Thank you for the review.
@smengcl Do you have any suggestions on the failed unit test?
@smengcl Do you have any suggestions on the failed unit test?
The failure looks related at a first glance: https://github.com/apache/ozone/runs/8008158105
Error: org.apache.hadoop.ozone.recon.recovery.TestReconOmMetadataManagerImpl.testUpdateOmDB Time elapsed: 0.249 s <<< FAILURE!
java.lang.AssertionError
at org.junit.Assert.fail(Assert.java:87)
at org.junit.Assert.assertTrue(Assert.java:42)
at org.junit.Assert.assertTrue(Assert.java:53)
at org.apache.hadoop.ozone.recon.recovery.TestReconOmMetadataManagerImpl.testUpdateOmDB(TestReconOmMetadataManagerImpl.java:139)
But Line 139 in TestReconOmMetadataManagerImpl.java doesn't really match any meaningful code on your branch, hmm.
I ran the same test locally and it passed.
Retriggering the CI.
Thanks @symious for the patch. Thanks @adoroszlai for the review.