[dvc][server][doc] Add exponential backoff for zk update retries + doc cleanup
Problem Statement
- The current
compareAndUpdateimplementation in ZkBaseDataAccessor retries immediately when an update fails. As shown in issue #656, ZK replica status updates can be interrupted by temporary network issues or disconnects, causing partitions to move to error state - The Venice Workspace Setup documentation examples for running tests have inconsistent formatting with unnecessary
$prefixes
Solution
- Added exponential backoff between retry attempts in the
compareAndUpdatemethod, which will sleep for2^retryseconds (1s, 2s, 4s) between each attempt, improving resiliency against temporary ZK connection issues - Removed unnecessary
$prompts from command examples for clarity and consistency in documentation
Code changes
- [ ] Added new code behind a config. If so list the config names and their default values in the PR description.
- [] Introduced new log lines.
- [] Confirmed if logs need to be rate limited to avoid excessive logging.
Concurrency-Specific Checks
Both reviewer and PR author to verify
- [] Code has no race conditions or thread safety issues.
- [] Proper synchronization mechanisms (e.g.,
synchronized,RWLock) are used where needed. - [] No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
- [] Verified thread-safe collections are used (e.g.,
ConcurrentHashMap,CopyOnWriteArrayList). - [] Validated proper exception handling in multi-threaded code to avoid silent thread termination.
How was this PR tested?
- [] Modified or extended existing tests.
- [] Verified backward compatibility (if applicable).
- Functionality remains the same, only the timing between retries has changed
- Documentation changes are non-functional and do not affect behavior
Does this PR introduce any user-facing or breaking changes?
- [x] No. You can skip the rest of this section.
- [ ] Yes. Clearly explain the behavior change and its impact.
NB: I believe this change could lead to race conditions due to Utils.sleep, although there are other methods in HelixUtils which are using Utils.sleep as well, and all methods using Utils.sleep in HelixUtils are covered with integration tests. Therefore, I believe my change can be tested by the CI before I need to add or change any current tests.
Also, as of this writing, there are no unit tests for compareAndUpdate, nor for any of the other methods using Utils.sleep, which is why I didn't add any either.
Please let me know if my reasoning is correct. Thank you!
In your title, can you please add [server] as a tag
NB: I believe this change could lead to race conditions due to Utils.sleep, although there are other methods in HelixUtils which are using Utils.sleep as well, and all methods using Utils.sleep in HelixUtils are covered with integration tests. Therefore, I believe my change can be tested by the CI before I need to add or change any current tests.
Also, as of this writing, there are no unit tests for compareAndUpdate, nor for any of the other methods using Utils.sleep, which is why I didn't add any either.
Please let me know if my reasoning is correct. Thank you!
I'm not sure I follow how this can lead to a race condition. Can you please explain a bit more?
There may not currently be any unit tests for compareAndUpdate; however, now is a great time to add one! But for this change, I don't think an integration test is necessary since many tests would cover this anyway.
NB: I believe this change could lead to race conditions due to Utils.sleep, although there are other methods in HelixUtils which are using Utils.sleep as well, and all methods using Utils.sleep in HelixUtils are covered with integration tests. Therefore, I believe my change can be tested by the CI before I need to add or change any current tests. Also, as of this writing, there are no unit tests for compareAndUpdate, nor for any of the other methods using Utils.sleep, which is why I didn't add any either. Please let me know if my reasoning is correct. Thank you!
I'm not sure I follow how this can lead to a race condition. Can you please explain a bit more?
I'm actually not sure either. I may have had an incorrect assumption at some point while working on the problem. You can ignore this comment
NB: I believe this change could lead to race conditions due to Utils.sleep, although there are other methods in HelixUtils which are using Utils.sleep as well, and all methods using Utils.sleep in HelixUtils are covered with integration tests. Therefore, I believe my change can be tested by the CI before I need to add or change any current tests. Also, as of this writing, there are no unit tests for compareAndUpdate, nor for any of the other methods using Utils.sleep, which is why I didn't add any either. Please let me know if my reasoning is correct. Thank you!
I'm not sure I follow how this can lead to a race condition. Can you please explain a bit more?
There may not currently be any unit tests for
compareAndUpdate; however, now is a great time to add one! But for this change, I don't think an integration test is necessary since many tests would cover this anyway.
I have added 2 unit tests in 020d73d:
- Success case
- Fail case after 3 retries
Thank you for the review @kvargha. I have commited my changes based on your suggestions.
Please let me know of any possible further improvements.
Thank you @kvargha. Changes have been applied in 99b5fa7.
Please see my comment here regarding implementation details for ZkDataAccessException.
Quick note here that we discussed implementing the exponential backoff for other methods in HelixUtils and corresponding tests in TestHelixUtils, which will be done in another PR once this current one is completed.
Some minor nits, but great job! This PR is almost ready to go.
Code looks good to me, but it looks like SpotBugs is failing.
Try running this locally and look for the HTML file path it outputs detailing the errors.
./gradlew :internal:venice-common:spotbugsTest
@kvargha
Code looks good to me, but it looks like SpotBugs is failing.
Try running this locally and look for the HTML file path it outputs detailing the errors.
./gradlew :internal:venice-common:spotbugsTest
Learned something new today 💡
Regarding these vars:
private final String TEST_PATH = "/test/path";
private final String TEST_DATA = "testData";
Spotbugs says:
This class contains an instance final field that is initialized to a compile-time static value. Consider making the field static.
This is raised as a performance issue because static fields are shared among all instances of a class rather than being duplicated for each test class instance.
Note that these vars aren't flaged, because compile-time constants are limited to primitive types and string literals (so, excluding arrays and collections):
private final List<String> TEST_PATH_LIST = Arrays.asList("/test/path/child1", "/test/path/child2");
private final List<String> TEST_DATA_LIST = Arrays.asList("data1", "data2");
private final boolean[] SUCCESS_RESULTS = new boolean[] { true, true };
private final boolean[] FAILED_RESULTS = new boolean[] { true, false };
Nonetheless, our intention is that these are constants, thus have marked them as static (which does lead to a performance optimization as well).
See d07c59b.