ozone HDDS-7156. Clean container's inconsistent pendingDeletingBlockCount

What changes were proposed in this pull request?

The metadata of "pendingDeleteBlockCount" can be inconsistent in some cases, this ticket is to clean this inconsistency.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-7156

How was this patch tested?

Not a normal case, works fine in the prod cluster.

Aug 22 '22 08:08 symious

@adoroszlai Could you help to review?

Aug 22 '22 08:08 symious

Hi @symious, could you please explain how the pending delete block count diverges from the correct value? It seems like we should identify the cause of the divergence before deciding the best way to fix it.

Aug 26 '22 23:08 errose28

/pending please explain how the pending delete block count diverges from the correct value

Jan 20 '23 14:01 adoroszlai

hey @symious ~ would you mind providing some cases when you encountered this inconsistent number? That'd be great! Thanks!

Jan 30 '23 07:01 DaveTeng0

Sorry for so late reply.

It happened quite a long time ago in our cluster, I remembered there are some related error logs indicating the metadata is inconsistent.

Since we applied this patch internally and it seems hard to reproduce the error log again.

Jan 30 '23 08:01 symious

+1 We need to root cause the divergence. This is a good hack to reset and fix the inconsistency but there might be a larger problem at hand.

Feb 06 '23 06:02 kerneltime

This does need unit tests. We hit this issue locally and @errose28 has a version of the patch with tests - we forgot this review was pending 😞

We should update this patch with Ethan's addendum changes before we commit it.

Feb 24 '23 22:02 arp7

It turns out the pending delete block count can remain high enough on empty containers where it causes starvation in the top N container choosing policy and freezes all pending deletes, making this bug much more serious than I initially thought. I am +1 to merge this type of reset fix with unit tests added, as long as we open a follow-up jira to actually figure out why the pending delete block count number is too high.

Feb 24 '23 22:02 errose28

We should make sure this fix works well with HDDS-7259 and HDDS-7302 which are newer than this patch, but still don't seem to completely address the problem.

Feb 24 '23 22:02 errose28

@symious please take a look at https://github.com/apache/ozone/pull/4324

Mar 02 '23 05:03 kerneltime

/ready

Apr 17 '23 16:04 adoroszlai

This issue was fixed in #4324 and the jira is resolved. I think we can close this PR. Please reopen if this is not correct.

Apr 17 '23 16:04 errose28

ozone ozone copied to clipboard

HDDS-7156. Clean container's inconsistent pendingDeletingBlockCount

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

ozone
ozone copied to clipboard