accumulo icon indicating copy to clipboard operation
accumulo copied to clipboard

Add version check to ZooPropStore

Open EdColeman opened this issue 2 years ago • 5 comments

This is the second half of splitting #2769 (first part is #2773)

EdColeman avatar Jun 14 '22 23:06 EdColeman

Currently seeking comments on approach - solution is complete if there are no objections - still should add tests (tested manually using uno)

EdColeman avatar Jun 14 '22 23:06 EdColeman

I think I would benefit from a basic explanation of what this PR is trying to do

The cache implementation provides multiple ways to refresh and expire entries.

  • refreshAfterWrite() - when an entry exceeds this value, the current value is returned and then an async task runs to determine if the entry is still valid. This was implemented as a light weight check using the ZooKeeper stat, similar to what is being moved to the scheduled thread in prop store.

  • expireAfterAccess() - allows the cache to age-off entries that are not accessed after the specified time period. This woul allow the cache to remove entries for tablets that are not longer hosted on a tserver.

The reasoning behind removing the refreshAfterWrite() responsibilities from the cache to the prop store is that with the property snapshots and the use of update count, the number of cache accesses for any particular entry are greatly reduced and it seems likely that the refresh would not be triggered in a timely manner. Using the scheduled thread in the prop store puts a bound on the consistency of the cache entries - if normal ZooKeeper events were missed for any reason, the scheduled task would serve to keep the values in sync within the bound of the scheduled task timing (roughly every 15 minutes)

EdColeman avatar Jun 16 '22 18:06 EdColeman

My understanding from what you've told me is that having this in ServerConfigurationFactory allows us to check that the snapshots are up-to-date without updating the access time of the cache. I think that's a good reason to implement it there, based on the current configs held by that factory. Thanks for the explanation.

ctubbsii avatar Jun 16 '22 19:06 ctubbsii

I have addressed @ctubbsii comments (one's left unresolved are to allow him to concur) - but this is ready for general review.

EdColeman avatar Jul 21 '22 16:07 EdColeman

This addresses all current PR comments. Because this is current with main, the full IT tests are likely to fail (see #2837) I think those failures are outside of this PR, but should be resolved before this is merged. I think there are two separate issues with the ITs:

  • The fate print and fate summary commands sometimes see FATES that are completed with SUCCESS but not clean-up.
  • The waitForFateOperations are failing with null pointer / runtime exception.

Both should be outside the scope of this PR

EdColeman avatar Aug 02 '22 00:08 EdColeman

I reviewed this as well and it also LGTM.

This may not be a big deal but my only comment is it would be nice to see a couple tests written for ConfigRefreshRunnner to verify the verifySnapshotVersions() method does what it is supposed to do after running, either running directly or through an integration test. (Check the cache entries were removed and configs inside ServerConfigurationFactory were cleared, etc). It's possible there are some existing tests that will exercise the change that already exist but I didn't see it with a quick search.

Running it directly would be a little annoying as you would either need to change the scope to package/protected to access it in tests so you could instantiate a new runner and execute verifySnapshotVersions() or use reflection.

For an ITest I suppose you could make the refresh interval configurable and set it something to be pretty short (a few seconds maybe) to verify it ran properly and evicted entries.

cshannon avatar Aug 26 '22 12:08 cshannon

@EdColeman - 2 approvals, can we merge this?

dlmarion avatar Sep 13 '22 11:09 dlmarion