RANGER-5322: PolicyRefresher can struck with no download thread if exception thrown in startRefresher
What changes were proposed in this pull request?
Handled the exception thrown in loadRoles() and loadPolicy()
How was this patch tested?
run mvn clean install and all tests are clean.
I updated the PR after further debugging and found that both loadRoles() and loadPolicy() handle all exceptions internally without throwing any. While the code isn't very clean or readable, it is correct. In this PR, I fixed a couple of minor issues and added tests to cover the scenarios.
@vyommani I see the usages of timeout/sleep in the tests so asking, how much time latency does the new tests add to the current tests? Would it be a good idea to keep constants like TEST_TIMEOUT_SECONDS in a constants class whereby all timeouts across different tests can be maintained in a single file, it might be helpful to segment these variables for (short, mod, long) running tests.
CC: @mneethiraj
@vyommani I see the usages of
timeout/sleepin the tests so asking, how much time latency does the new tests add to the current tests? Would it be a good idea to keep constants likeTEST_TIMEOUT_SECONDSin a constants class whereby all timeouts across different tests can be maintained in a single file, it might be helpful to segment these variables for (short, mod, long) running tests.CC: @mneethiraj
@kumaab, It will be really good if we create a separate file where we put all the wait timeout etc. For the newly added test below are the approximate worst and average case times.
Worst-case latency: 10-15 seconds (due to factors like slow I/O, thread scheduling, or retries) Actual latency:
1->Latch countdown: 100-500 ms 2->File polling: under 1 second 3->Real average per test: ~1-3 seconds
The latency is deemed acceptable to ensure reliable testing of the PolicyRefresher background thread, which involves: Asynchronous policy loading
1->Polling Ranger Admin 2->Writing cached policies to disk 3->Handling retries and edge cases
Without these waits, tests might be flaky, passing locally but failing in CI due to timing issues.