venice icon indicating copy to clipboard operation
venice copied to clipboard

[dvc][server][doc] Add exponential backoff for zk update retries + doc cleanup

Open gabrieldrouin opened this issue 8 months ago • 7 comments

Problem Statement

  • The current compareAndUpdate implementation in ZkBaseDataAccessor retries immediately when an update fails. As shown in issue #656, ZK replica status updates can be interrupted by temporary network issues or disconnects, causing partitions to move to error state
  • The Venice Workspace Setup documentation examples for running tests have inconsistent formatting with unnecessary $ prefixes

Solution

  • Added exponential backoff between retry attempts in the compareAndUpdate method, which will sleep for 2^retry seconds (1s, 2s, 4s) between each attempt, improving resiliency against temporary ZK connection issues
  • Removed unnecessary $ prompts from command examples for clarity and consistency in documentation

Code changes

  • [ ] Added new code behind a config. If so list the config names and their default values in the PR description.
  • [] Introduced new log lines.
    • [] Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

  • [] Code has no race conditions or thread safety issues.
  • [] Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
  • [] No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
  • [] Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
  • [] Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

  • [] Modified or extended existing tests.
  • [] Verified backward compatibility (if applicable).
    • Functionality remains the same, only the timing between retries has changed
    • Documentation changes are non-functional and do not affect behavior

Does this PR introduce any user-facing or breaking changes?

  • [x] No. You can skip the rest of this section.
  • [ ] Yes. Clearly explain the behavior change and its impact.

gabrieldrouin avatar Apr 25 '25 18:04 gabrieldrouin

NB: I believe this change could lead to race conditions due to Utils.sleep, although there are other methods in HelixUtils which are using Utils.sleep as well, and all methods using Utils.sleep in HelixUtils are covered with integration tests. Therefore, I believe my change can be tested by the CI before I need to add or change any current tests.

Also, as of this writing, there are no unit tests for compareAndUpdate, nor for any of the other methods using Utils.sleep, which is why I didn't add any either.

Please let me know if my reasoning is correct. Thank you!

gabrieldrouin avatar Apr 25 '25 18:04 gabrieldrouin

In your title, can you please add [server] as a tag

kvargha avatar Apr 29 '25 04:04 kvargha

NB: I believe this change could lead to race conditions due to Utils.sleep, although there are other methods in HelixUtils which are using Utils.sleep as well, and all methods using Utils.sleep in HelixUtils are covered with integration tests. Therefore, I believe my change can be tested by the CI before I need to add or change any current tests.

Also, as of this writing, there are no unit tests for compareAndUpdate, nor for any of the other methods using Utils.sleep, which is why I didn't add any either.

Please let me know if my reasoning is correct. Thank you!

I'm not sure I follow how this can lead to a race condition. Can you please explain a bit more?

There may not currently be any unit tests for compareAndUpdate; however, now is a great time to add one! But for this change, I don't think an integration test is necessary since many tests would cover this anyway.

kvargha avatar Apr 29 '25 04:04 kvargha

NB: I believe this change could lead to race conditions due to Utils.sleep, although there are other methods in HelixUtils which are using Utils.sleep as well, and all methods using Utils.sleep in HelixUtils are covered with integration tests. Therefore, I believe my change can be tested by the CI before I need to add or change any current tests. Also, as of this writing, there are no unit tests for compareAndUpdate, nor for any of the other methods using Utils.sleep, which is why I didn't add any either. Please let me know if my reasoning is correct. Thank you!

I'm not sure I follow how this can lead to a race condition. Can you please explain a bit more?

I'm actually not sure either. I may have had an incorrect assumption at some point while working on the problem. You can ignore this comment

gabrieldrouin avatar Apr 29 '25 10:04 gabrieldrouin

NB: I believe this change could lead to race conditions due to Utils.sleep, although there are other methods in HelixUtils which are using Utils.sleep as well, and all methods using Utils.sleep in HelixUtils are covered with integration tests. Therefore, I believe my change can be tested by the CI before I need to add or change any current tests. Also, as of this writing, there are no unit tests for compareAndUpdate, nor for any of the other methods using Utils.sleep, which is why I didn't add any either. Please let me know if my reasoning is correct. Thank you!

I'm not sure I follow how this can lead to a race condition. Can you please explain a bit more?

There may not currently be any unit tests for compareAndUpdate; however, now is a great time to add one! But for this change, I don't think an integration test is necessary since many tests would cover this anyway.

I have added 2 unit tests in 020d73d:

  1. Success case
  2. Fail case after 3 retries

gabrieldrouin avatar Apr 29 '25 15:04 gabrieldrouin

Thank you for the review @kvargha. I have commited my changes based on your suggestions.

Please let me know of any possible further improvements.

gabrieldrouin avatar Apr 29 '25 15:04 gabrieldrouin

Thank you @kvargha. Changes have been applied in 99b5fa7.

Please see my comment here regarding implementation details for ZkDataAccessException.

gabrieldrouin avatar Apr 30 '25 01:04 gabrieldrouin

Quick note here that we discussed implementing the exponential backoff for other methods in HelixUtils and corresponding tests in TestHelixUtils, which will be done in another PR once this current one is completed.

gabrieldrouin avatar May 06 '25 02:05 gabrieldrouin

Some minor nits, but great job! This PR is almost ready to go.

kvargha avatar May 06 '25 23:05 kvargha

Code looks good to me, but it looks like SpotBugs is failing.

Try running this locally and look for the HTML file path it outputs detailing the errors. ./gradlew :internal:venice-common:spotbugsTest

kvargha avatar May 08 '25 16:05 kvargha

@kvargha

Code looks good to me, but it looks like SpotBugs is failing.

Try running this locally and look for the HTML file path it outputs detailing the errors. ./gradlew :internal:venice-common:spotbugsTest

Learned something new today 💡

Regarding these vars:

private final String TEST_PATH = "/test/path";
private final String TEST_DATA = "testData";

Spotbugs says:

This class contains an instance final field that is initialized to a compile-time static value. Consider making the field static.

This is raised as a performance issue because static fields are shared among all instances of a class rather than being duplicated for each test class instance.

Note that these vars aren't flaged, because compile-time constants are limited to primitive types and string literals (so, excluding arrays and collections):

private final List<String> TEST_PATH_LIST = Arrays.asList("/test/path/child1", "/test/path/child2");
private final List<String> TEST_DATA_LIST = Arrays.asList("data1", "data2");
private final boolean[] SUCCESS_RESULTS = new boolean[] { true, true };
private final boolean[] FAILED_RESULTS = new boolean[] { true, false };

Nonetheless, our intention is that these are constants, thus have marked them as static (which does lead to a performance optimization as well).

See d07c59b.

gabrieldrouin avatar May 08 '25 20:05 gabrieldrouin