ozone icon indicating copy to clipboard operation
ozone copied to clipboard

HDDS-7156. Clean container's inconsistent pendingDeletingBlockCount

Open symious opened this issue 2 years ago • 2 comments

What changes were proposed in this pull request?

The metadata of "pendingDeleteBlockCount" can be inconsistent in some cases, this ticket is to clean this inconsistency.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-7156

How was this patch tested?

Not a normal case, works fine in the prod cluster.

symious avatar Aug 22 '22 08:08 symious

@adoroszlai Could you help to review?

symious avatar Aug 22 '22 08:08 symious

Hi @symious, could you please explain how the pending delete block count diverges from the correct value? It seems like we should identify the cause of the divergence before deciding the best way to fix it.

errose28 avatar Aug 26 '22 23:08 errose28

/pending please explain how the pending delete block count diverges from the correct value

adoroszlai avatar Jan 20 '23 14:01 adoroszlai

hey @symious ~ would you mind providing some cases when you encountered this inconsistent number? That'd be great! Thanks!

DaveTeng0 avatar Jan 30 '23 07:01 DaveTeng0

Sorry for so late reply.

It happened quite a long time ago in our cluster, I remembered there are some related error logs indicating the metadata is inconsistent.

Since we applied this patch internally and it seems hard to reproduce the error log again.

symious avatar Jan 30 '23 08:01 symious

+1 We need to root cause the divergence. This is a good hack to reset and fix the inconsistency but there might be a larger problem at hand.

kerneltime avatar Feb 06 '23 06:02 kerneltime

This does need unit tests. We hit this issue locally and @errose28 has a version of the patch with tests - we forgot this review was pending 😞

We should update this patch with Ethan's addendum changes before we commit it.

arp7 avatar Feb 24 '23 22:02 arp7

It turns out the pending delete block count can remain high enough on empty containers where it causes starvation in the top N container choosing policy and freezes all pending deletes, making this bug much more serious than I initially thought. I am +1 to merge this type of reset fix with unit tests added, as long as we open a follow-up jira to actually figure out why the pending delete block count number is too high.

errose28 avatar Feb 24 '23 22:02 errose28

We should make sure this fix works well with HDDS-7259 and HDDS-7302 which are newer than this patch, but still don't seem to completely address the problem.

errose28 avatar Feb 24 '23 22:02 errose28

@symious please take a look at https://github.com/apache/ozone/pull/4324

kerneltime avatar Mar 02 '23 05:03 kerneltime

/ready

adoroszlai avatar Apr 17 '23 16:04 adoroszlai

This issue was fixed in #4324 and the jira is resolved. I think we can close this PR. Please reopen if this is not correct.

errose28 avatar Apr 17 '23 16:04 errose28