spicedb icon indicating copy to clipboard operation
spicedb copied to clipboard

Support partitioning and resumability on ReadRelationships

Open jason-who-codes opened this issue 2 years ago • 1 comments

For the case of an extremely large SpiceDB cluster (e.g. billions of stored relations), there is not currently an effective mechanism for iterating over the entire data set. This would be a very useful capability for doing reconciliation/consistency checking of data in SpiceDB against systems of record (from which SpiceDB relations are replicated) and search indexes (which may store permissions information replicated from SpiceDB).

The ReadRelationships API provides a streaming response, but no mechanism to "restart" reading from where a previous invocation left off. Providing a "start cursor" would enable a large ReadRelationships operation to be resumed from the application side.

Additionally, for very large data sets it will likely be necessary to partition and parallelize the read operation. It would be beneficial to be able to shard/partition the data set, specifying a partition number as an input parameter.

For example, the API might look something like:

message ReadRelationshipsRequest {
	Consistency consistency = 1;
	RelationshipFilter relationship_filter = 2;
        int32 partition_number = 3;
        string start_cursor = 4;
}

I would imagine that the partition count might be configurable per-cluster (e.g. a mid-sized cluster might be fine with 10 partitions, a tremendous cluster might need 1000 or more) but the partition assignment should be "stable" within a given cluster (i.e. if a given relation is returned in partition 1 on one ReadRelationships request, that relation should always be returned in partition 1 in subsequent calls).

jason-who-codes avatar Feb 10 '23 22:02 jason-who-codes

Breaking into two issues, starting with cursors: https://github.com/authzed/spicedb/issues/1339

josephschorr avatar May 22 '23 16:05 josephschorr