corrosion icon indicating copy to clipboard operation
corrosion copied to clipboard

Buckets-based hashing for gentler resource utilization

Open jeromegn opened this issue 2 years ago • 1 comments

Corrosion presently makes full table scans to hash the contents of everything to be able to compare consistency between actors.

This is an expensive operation and it gets more and more expensive.

The new plan would be to create buckets based on primary keys hashes:

  • When Corrosion has written changes, hash all the unique primary keys, down to a u64
  • <hash> - (<hash> % <buckets size>) to determine the bucket "id"
  • Store the ID in a table linking table name, primary key and bucket ID.
  • Queue a background job to update the bucket's hash
  • Hash all bucket hashes periodically (or maybe in-place with XOR?) to determine the full consistency hash

This has several benefits:

  • No need to scan full tables
  • Should still provide a valid, consistent, hash across nodes
  • Can be done continuously

I believe this depends on https://github.com/vlcn-io/cr-sqlite/pull/344 so the pk <-> bucket ID table doesn't weigh too much. We'd be able to use the cr-sqlite-encoded primary keys which are varints and therefore a lot smaller.

jeromegn avatar Sep 20 '23 15:09 jeromegn

Update: this should be implemented in cr-sqlite directly and feature-flagged or something like that.

jeromegn avatar Nov 28 '23 19:11 jeromegn