centraldogma icon indicating copy to clipboard operation
centraldogma copied to clipboard

Improve the robustness of old ZooKeeper log removal

Open trustin opened this issue 1 year ago • 1 comments

Motivation:

OldLogRemover in ZooKeeperCommandExecutor currently catches a Throwable when deleting an old log or its log blocks. However, it has two issues doing so:

  • It doesn't handle an exception that's raised when reading the metadata of the old log.
  • Throwable is way too wide exception to catch. Catching a KeeperException whose code is NONODE will be enough.
    • Note that the failure will only transfer the leadership to other replica, rather than stopping the whole replication process.

Modifications:

  • OldLogRemover now catches KeeperException whose code is NONODE only.
  • An attempt to read a missing log node's metadata is now handled properly.
  • Added more detail to the log messages about missing nodes
    • Split deleteLog() into deleteLog() and deleteLogBlock()

Result:

  • The leadership is not transferred anymore when OldLogRemover attempts to retrieve a missing log node's metadata, which is not really a critical issue.
    • Instead, the leadership will be transferred when an exception occurs not because of a missing node.

trustin avatar Sep 27 '24 16:09 trustin

Throwable is way too wide exception to catch. Catching a KeeperException whose code is NONODE will be enough. Note that the failure will only transfer the leadership to other replica, rather than stopping the whole replication process.

Question: Is there any chance that the replica, which receives the leadership, raises the same exception?

minwoox avatar Oct 08 '24 10:10 minwoox