accumulo
accumulo copied to clipboard
Add table ID sanity checks to garbage collector
Currently the Accumulo GC checks that each user table seen in the metadata table is properly formed (this check was recently improved by #1266). However there is no check to ensure all expected user tables are seen in the metadata table. So if there is an error and nothing is seen for a user table in the metadata table, then the Accumulo GC will not know there is a problem.
The garbage collection algorithm reads a set of delete candidates into memory and then scans the metadata table to remove any candidates that a referenced. Sanity checks could added to cross reference tables ids seen in the metadata table with zookeeper.
One possible way to do this is with the following three sets :
- BSTI : Table ids in zookeeper before the scan. Excluding some table states like NEW and DELETING.
- UMTI : Table ids seen while scanning metadata table
- ASTI : Table ids in zookeeper after the scan. Excluding some table states like NEW and DELETING.
If (BSTI ∩ ASTI) ⊆ UMTI is true then all expected table ids were seen. If its not true, then its not safe to delete files. Building these sets and checking them in the GC before deleting could make the Accumulo GC more robust against unknown errors when scanning the metadata table.
Unassigned myself as @mjwall expressed interest in working on the ticket as it may be related to the fix needed for #1916 that he has been investigating.