ozone
ozone copied to clipboard
HDDS-7156. Clean container's inconsistent pendingDeletingBlockCount
What changes were proposed in this pull request?
The metadata of "pendingDeleteBlockCount" can be inconsistent in some cases, this ticket is to clean this inconsistency.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-7156
How was this patch tested?
Not a normal case, works fine in the prod cluster.
@adoroszlai Could you help to review?
Hi @symious, could you please explain how the pending delete block count diverges from the correct value? It seems like we should identify the cause of the divergence before deciding the best way to fix it.
/pending please explain how the pending delete block count diverges from the correct value
hey @symious ~ would you mind providing some cases when you encountered this inconsistent number? That'd be great! Thanks!
Sorry for so late reply.
It happened quite a long time ago in our cluster, I remembered there are some related error logs indicating the metadata is inconsistent.
Since we applied this patch internally and it seems hard to reproduce the error log again.
+1 We need to root cause the divergence. This is a good hack to reset and fix the inconsistency but there might be a larger problem at hand.
This does need unit tests. We hit this issue locally and @errose28 has a version of the patch with tests - we forgot this review was pending 😞
We should update this patch with Ethan's addendum changes before we commit it.
It turns out the pending delete block count can remain high enough on empty containers where it causes starvation in the top N container choosing policy and freezes all pending deletes, making this bug much more serious than I initially thought. I am +1 to merge this type of reset fix with unit tests added, as long as we open a follow-up jira to actually figure out why the pending delete block count number is too high.
We should make sure this fix works well with HDDS-7259 and HDDS-7302 which are newer than this patch, but still don't seem to completely address the problem.
@symious please take a look at https://github.com/apache/ozone/pull/4324
/ready
This issue was fixed in #4324 and the jira is resolved. I think we can close this PR. Please reopen if this is not correct.