corrosion icon indicating copy to clipboard operation
corrosion copied to clipboard

Probabilistic data structure for "current" known versions

Open jeromegn opened this issue 3 months ago • 0 comments

Probabilistic data structures can help Corrosion sync faster by preventing 99% of lookups for "cleared" versions.

Right now, a DB lookup is required if a version is known because Corrosion does not know if there are any changes in the DB for this version. This is a fairly recent change where everything isn't stored in memory anymore.

People are probably most familiar with Bloom filters. They're pretty great but have shortcomings: they're usually immutable and if they are mutable then it's not possible to delete items from them.

The most interesting data structure for us might be the Cuckoo filter which is similar to a Bloom or XOR filter, but it allows deletion. There's a crate called scalable_cuckoo_filter which seems to fit the bill. It is serializable so we can store it in the DB and load it on boot.

Since these data structures will never return false negatives but may return false positives, we have to store information that's "ok" to be wrong about and a DB lookup won't hurt anything. If we stored which versions are cleared, then we may mistakenly tell a different node that a version has been cleared (false positive). However, if we store current versions only, then a negative definitely means the version was cleared or we don't have it (checked against BookedVersions#needed). If we get a false positive, that's ok because Corrosion can query the DB and the worst case scenario is finding nothing and returning that the version has been cleared.

There's a challenge in fully initializing this structure because the queries can be expensive. Possibly iterating through hundreds of millions of rows in __corro_bookkeeping and various clock tables. It would have to be done incrementally. We can't use the structure until it has been filled completely by both the database's state and incoming data as we're processing broadcasts and syncs. This is tricky, but not impossible:

  • Start loading data incrementally in the background (a few versions at a time to not hog the database resources)
  • Insert / delete versions as Corrosion is processing syncs and broadcasts
  • Wrap the filter in a structure that knows if it is complete
  • Periodically (?) save the serialized filter w/ the "last processed version" information
  • Only rely on the filter if it has been fully loaded, meanwhile fallback to lookups, it ain't that bad

In my testing, a cuckoo filter of ~16 million u64 uses roughly 30MB of memory. It's probably less than that, I was looking at total memory usage of a test binary.

jeromegn avatar Apr 13 '24 20:04 jeromegn