ozone icon indicating copy to clipboard operation
ozone copied to clipboard

HDDS-6312. Use KeyPrefixContainer table to accelerate the process of DELETE/UPDATE events

Open symious opened this issue 3 years ago • 5 comments

What changes were proposed in this pull request?

Recon stores the mapping of ContainerKeyPrefix in local RocksDB. When Recon is applying DELETE or UPDATE events from OM, it will run search the whole table for each to_be_deleted record.

In a big cluster, the record count in this table could be very large, and the search loop for each records is very slow. In our cluster there are 90m records, each loop cost over 70 seconds, if a delta OM events have 100 DELETE or UPDATE events, it will took about two hours to apply these updates.

This ticket is to accelerate the process with the help of a new local table.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-6312

How was this patch tested?

unit test.

symious avatar Feb 14 '22 10:02 symious

@adoroszlai @avijayanhwx @ferhui Could you help to review this PR?

symious avatar Feb 15 '22 14:02 symious

@JacksonYao287 Could you please help review?

ferhui avatar Feb 18 '22 10:02 ferhui

@avijayanhwx Could you help to check this issue? This issue can be a problem if we want to retrieve updated information from OM.

In a small cluster, the connection from Recon to OM can be quite normal, with 10 minutes' interval. Please ignore the content in the following block, but each request is from Recon to OM to get delta updates.

2022-04-06 09:41:57,273 [Socket Reader #1 for port 9862] WARN org.apache.hadoop.ipc.Server: Connection Authentication from Recon:56662 for protocol org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol is failed for user ozone (auth:SIMPLE). 
2022-04-06 09:51:57,357 [Socket Reader #1 for port 9862] WARN org.apache.hadoop.ipc.Server: Connection Authentication from Recon:60932 for protocol org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol is failed for user ozone (auth:SIMPLE). 
2022-04-06 10:01:57,436 [Socket Reader #1 for port 9862] WARN org.apache.hadoop.ipc.Server: Connection Authentication from Recon:63884 for protocol org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol is failed for user ozone (auth:SIMPLE). 
2022-04-06 10:11:57,544 [Socket Reader #1 for port 9862] WARN org.apache.hadoop.ipc.Server: Connection Authentication from Recon:24028 for protocol org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol is failed for user ozone (auth:SIMPLE). 
2022-04-06 10:22:04,916 [Socket Reader #1 for port 9862] WARN org.apache.hadoop.ipc.Server: Connection Authentication from Recon:26968 for protocol org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol is failed for user ozone (auth:SIMPLE). 

But in a big cluster, since the process is too slow, this connection can be far longer then 10min's interval.

symious avatar Apr 06 '22 03:04 symious

Thanks @symious for the patch and sorry for the long delay in review. Can you please resolve merge conflicts?

adoroszlai avatar Jul 24 '22 19:07 adoroszlai

@adoroszlai Thanks for the review. Updated the patch, please have a check.

symious avatar Jul 25 '22 03:07 symious

Thanks @symious for the patch. There's another patch causing minor conflicts. Would you merge latest master to the PR branch again?

I have resolved the conflict on my branch (I tried to push to your PR branch but got no permission), you can take this as a reference: https://github.com/smengcl/hadoop-ozone/blob/HDDS-6312/hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/tasks/ContainerKeyMapperTask.java#L249-L275

smengcl avatar Aug 23 '22 21:08 smengcl

@smengcl Updated the PR, please have a look.

symious avatar Aug 24 '22 03:08 symious

@smengcl Updated the patch, please have a look.

symious avatar Aug 24 '22 13:08 symious

@smengcl Thank you for the review.

symious avatar Aug 25 '22 03:08 symious

@smengcl Do you have any suggestions on the failed unit test?

symious avatar Aug 25 '22 06:08 symious

@smengcl Do you have any suggestions on the failed unit test?

The failure looks related at a first glance: https://github.com/apache/ozone/runs/8008158105

Error:  org.apache.hadoop.ozone.recon.recovery.TestReconOmMetadataManagerImpl.testUpdateOmDB  Time elapsed: 0.249 s  <<< FAILURE!
java.lang.AssertionError
	at org.junit.Assert.fail(Assert.java:87)
	at org.junit.Assert.assertTrue(Assert.java:42)
	at org.junit.Assert.assertTrue(Assert.java:53)
	at org.apache.hadoop.ozone.recon.recovery.TestReconOmMetadataManagerImpl.testUpdateOmDB(TestReconOmMetadataManagerImpl.java:139)

But Line 139 in TestReconOmMetadataManagerImpl.java doesn't really match any meaningful code on your branch, hmm.

I ran the same test locally and it passed.

Retriggering the CI.

smengcl avatar Aug 25 '22 07:08 smengcl

Thanks @symious for the patch. Thanks @adoroszlai for the review.

smengcl avatar Aug 30 '22 07:08 smengcl