accumulo closes #1377 - ensure all tables are checked ...

... when removing references to candidates that are still in use.

This is done by getting tableIds from zookeeper at the start of the reference scan and at the end of the reference scan. During the reference scans, the list of tableID is tracked. After all candidates that are still in use have been removed, there is a one more consitency check to ensure we looked at all tables.

Jun 29 '22 14:06 dlmarion

This is a cherry-pick of @mjwall 's GC fix on the 1.10 branch applied to 2.1

Jun 29 '22 14:06 dlmarion

I think all the logic is sound for this improvement but there is just some minor cleanup that could be done, most likely due to cherry picking and/or multiple contributors.

Aug 26 '22 15:08 milleruntime

@keith-turner - I implemented your changes. I kind of expected a test failure, but that didn't happen.

Update: I found that I had to make a similar change in GarbageCollectionTest, now it fails....

Sep 14 '22 12:09 dlmarion

@dlmarion I made some updatesto the test in this commit keith-turner/accumulo@78cf05d9af90bb7ba218a17cda6755df80127702 to use DataLevel and to not have to add the metadata refs

Sep 14 '22 14:09 keith-turner

@keith-turner - I like it. Can you create a PR against my branch and I will merge it in?

Sep 14 '22 16:09 dlmarion

Nevermind, I grabbed the diff and applied manually

Sep 14 '22 16:09 dlmarion

So, I'm not convinced this is the right approach.

What's the implication of this statement? Are you saying that we need to go back to the drawing board? If I implement your suggested changes, is this good to go?

Sep 15 '22 17:09 dlmarion

So, I'm not convinced this is the right approach.

What's the implication of this statement? Are you saying that we need to go back to the drawing board? If I implement your suggested changes, is this good to go?

Unless the edge cases for table creation/deletion are addressed (like paying attention to table states, like Keith mentioned), then this should definitely not be merged as is. And, I wouldn't recommend its inclusion in any forked version of 1.10 either, because of the potential harmful side-effects I mentioned.

If the table creation/deletion edge cases are addressed, I would have to re-review. I've only reviewed what's there, not what code changes have yet to be proposed.

I'm not sure what "go[ing] back to the drawing board" would entail. It feels like we're shooting in the dark, adding a sanity check against a problem that is not well-defined.

Sep 16 '22 01:09 ctubbsii

It feels like we're shooting in the dark, adding a sanity check against a problem that is not well-defined.

I think the validation outlined in #1377 is very well defined: If table ids w/ a certain state were seen in ZK before and after metadata scan then we must see them in metadata OR something is really really wrong. The less the GC trust that the persisted data it reads is correct and the more it tries to verify what it can about that data the more robust GC will be. The GC already does a good bit of checks on the metadata table as it reads it. Checking the metadata table against ZK will strengthen those existing verifications.

In addition to the check outlined in #1377, another check was added in #2293 for the case where table ids were seen in metadata table and none were seen in ZK. This situation can legitimately happen when the total number of tables goes to/from zero concurrently with a GC cycle. This situation could also happen when ZK read is silently failing and returning no data. There is no way to distinguish between the two causes, so it can have false positives. If the system is actually working correctly, then the legitimate case would be transient. We could defer this second check and open an issue to do it another PR where we try to figure out how to avoid the false positive or make the check less noisy (like always skip GC cycle but log info at first and warn/error on subsequent cycles if seen again). Or we could leave the check as-is in the PR with the knowledge that it can always be modified/removed in a bug fix release. I lean towards leaving it and dealing with it if its problematic in a bug fix release.

Sep 16 '22 09:09 keith-turner

@ctubbsii - do these recent changes resolve the changes you requested?

Sep 20 '22 15:09 dlmarion

Full IT passed except for the test failure identified in #2948

Sep 22 '22 10:09 dlmarion