See below for the current proposal

Problem Statement

We are migrating our data from a legacy authorization system into SpiceDB and would like a way to easily verify the relationship counts in order to validate the accuracy of our data. Specifically, we would like to be able to do this in the following scenarios:

After hydrating data to verify counts match our legacy system
Periodically (e.g. daily) to ensure we our real-time-update logic from our legacy system to our SpiceDB instance are in sync, allowing us to investigate further if this is not the case

We can currently solve scenario 1 above by subscribing the Watch API while performing a hydration and maintaining the counts internally. For scenario 2 we can do a filtered ReadRelationships call for each relationship, but this requires retrieving more data than necessary from the database since we only care about the counts in the end. The ReadRelationships API also has the limitation of requiring us to make the calls in chunks of 1000, which forces us to process things serially.

In both cases having a relationships count API that simply performs a SELECT count(*) from relation_tuple WHERE ... clause against the database would be much more efficient/user friendly.

Solution Brainstorm

Implement a CountRelationships API with the following spec:

rpc CountRelationships(CountRelationshipsRequest)
  returns (stream CountRelationshipsResponse) {
    option (google.api.http) = {
      post: "/v1/relationships/count"
      body: "*"
    };
}

message CountRelationshipsRequest {
  Consistency consistency = 1;

  // relationship_filter defines the filter to be applied to the relationships
  // to be counted.
  RelationshipFilter relationship_filter = 2
      [ (validate.rules).message.required = true ];
}

message CountRelationshipsResponse {
  // read_at is the ZedToken at which the relationship count was performed
  ZedToken read_at = 1 [ (validate.rules).message.required = true ];

  uint64 relationship_count = 2;
}

Apr 08 '24 17:04 ryaneorth

My concerns with this idea are two-fold:

count(*) operations can be extremely heavy on databases, especially in CRDB where they are essentially linear. I believe this would be somewhat mitigated by the query only hitting the index(es) we already have, but I'm not convinced it wouldn't become a large issue, especially if the API is invoked multiple times a minute (or worst, many times per second)
What is the intention if the found count differs from that expected? At that point, there is some sort of data inconsistency, but unlike the watch solution, there is no means to determine exactly what is missing/broken/wrong. Would the intent then be to delete all matching relationships and re-hydrate?

If the intention is to only have this used during the initial hydration and subsequently for a small window, I believe having a service which consumes the Watch API and keeps a real-time count (bucketed by whatever filters are useful) might be a better solution: it could be an open source solution as well, configurable from the command line to take in filters, keep a real-time count and, using our ability to go backwards in time with Watch, be able to "pick up" if the watcher goes down

Apr 16 '24 21:04 josephschorr

Thanks for the response @josephschorr . I'll do my best to answer your questions:

count(*) operations can be extremely heavy on databases, especially in CRDB where they are https://github.com/cockroachdb/cockroach/issues/7825. I believe this would be somewhat mitigated by the query only hitting the index(es) we already have, but I'm not convinced it wouldn't become a large issue, especially if the API is invoked multiple times a minute (or worst, many times per second)

This is a legitimate concern, but I do think the existing indexes will help here quite a bit. I ran multiple different count(*) queries that could be formulated by different RelationshipFilters representing the types of queries which theoretically be provided to this API:

RelationshipFilter: resource_type: user

> explain select count(*) from relation_tuple where namespace='user';                                                                                     
                                           info
------------------------------------------------------------------------------------------
  distribution: full
  vectorized: true

  • group (scalar)
  │ estimated row count: 1
  │
  └── • scan
        estimated row count: 1,495,925 (0.75% of the table; stats collected 21 days ago)
        table: relation_tuple@pk_relation_tuple
        spans: [/'user' - /'user']
(10 rows)

 > select count(*) from relation_tuple where namespace='user';                                                                                             
   count
-----------
  1371089
(1 row)

Time: 757ms total (execution 708ms / network 49ms)

RelationshipFilter: resource_type: user, optional_resource_id: abc

  > explain select count(*) from relation_tuple where namespace='user' AND object_id='abc';   
                                                                                              info
  --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    distribution: full
    vectorized: true

    • group (scalar)
    │ estimated row count: 1
    │
    └── • scan
          estimated row count: 0 (<0.01% of the table; stats collected 21 days ago)
          table: relation_tuple@pk_relation_tuple
          spans: [/'user'/'abc' - /'user'/'abc']
  (10 rows)

  Time: 32ms total (execution 1ms / network 31ms)

  > select count(*) from relation_tuple where namespace='user' AND object_id='a';           
    count
  ---------
        2
  (1 row)

  Time: 35ms total (execution 3ms / network 32ms)

RelationshipFilter: resource_type: user, optional_resource_id_prefix: a

  > explain select count(*) from relation_tuple where namespace='user' AND object_id LIKE 'a%';                              
                                                          info
--------------------------------------------------------------------------------------------------------------------------
  distribution: full
  vectorized: true

  • group (scalar)
  │ estimated row count: 1
  │
  └── • scan
        estimated row count: 17,790 (<0.01% of the table; stats collected 21 days ago)
        table: relation_tuple@pk_relation_tuple
        spans: [/'user'/'a' - /'user'/'b')
(10 rows)

Time: 35ms total (execution 1ms / network 34ms)

> select count(*) from relation_tuple where namespace='user' AND object_id LIKE 'a%';                                      
  count
---------
    380
(1 row)

Time: 35ms total (execution 4ms / network 31ms)

RelationshipFilter: resource_type: user, optional_resource_id_prefix: a, relation: manager

  > explain select count(*) from relation_tuple where namespace='user' AND object_id LIKE 'a%' AND relation='manager';                                     
                                          info
--------------------------------------------------------------------------------------------
distribution: full
vectorized: true

• group (scalar)
│ estimated row count: 1
│
└── • filter
    │ estimated row count: 0
    │ filter: relation = 'manager'
    │
    └── • scan
          estimated row count: 206,272 (0.10% of the table; stats collected 21 days ago)
          table: relation_tuple@pk_relation_tuple
          spans: [/'user'/'a'/'manager' - /'user'/'b')

index recommendations: 1
1. type: index creation
  SQL command: CREATE INDEX ON "spicedb-production".public.relation_tuple (relation);
(18 rows)

Time: 38ms total (execution 6ms / network 32ms)

> select count(*) from relation_tuple where namespace='user' AND object_id LIKE 'a%' AND relation='manager';                                             
count
---------
  106
(1 row)

Time: 56ms total (execution 3ms / network 53ms)

RelationshipFilter: resource_type: user, optional_resource_id_prefix: a, relation: manager, subject_type: user

  > explain select count(*) from relation_tuple where namespace='user' AND object_id LIKE 'a%' AND relation='manager' AND userset_namespace='user'; 
                                                          info
  ---------------------------------------------------------------------------------------------------------------
  distribution: full
  vectorized: true

  • group (scalar)
  │ estimated row count: 1
  │
  └── • filter
      │ estimated row count: 3
      │ filter: (relation = 'manager') AND (userset_namespace = 'user')
      │
      └── • scan
            estimated row count: 1,353,438 (0.68% of the table; stats collected 21 days ago)
            table: relation_tuple@pk_relation_tuple
            spans: [/'user'/'a'/'manager'/'user' - /'user'/'b')

  index recommendations: 1
  1. type: index creation
    SQL command: CREATE INDEX ON "spicedb-production".public.relation_tuple (namespace, relation, userset_namespace);
(18 rows)

Time: 47ms total (execution 6ms / network 41ms)

> select count(*) from relation_tuple where namespace='user' AND object_id LIKE 'a%' AND relation='manager' AND userset_namespace='user';         
  count
---------
  1850
(1 row)

Time: 46ms total (execution 15ms / network 31ms)

As a note there are a total 202,626,032 of relations in this CockroachDB database. While I recognize the frequency of different values may lead to varying performance, overall the indexes already implemented allowed the search space to be vastly reduced.

Furthermore, it would be acceptable in my mind to execute these queries with AS OF SYSTEM TIME follower_read_timestamp(); ensuring the request is made against any available replica. Because this is more of a "stats" API I personally don't think exact-up-to-date numbers are strictly required.

If this really does become a problem where people are abusing this API I think adding some form of rate limiting is reasonable and is probably something that should be considered for APIs in general at some point in time.

What is the intention if the found count differs from that expected? At that point, there is some sort of data inconsistency, but unlike the watch solution, there is no means to determine exactly what is missing/broken/wrong. Would the intent then be to delete all matching relationships and re-hydrate?

It depends on the scenario. If it is a systemic issue then a bulk export on a subset of the data is likely required to investigate the root cause. If it is a one-off issue, then dehydration/rehydration of the data is likely the simplest fix.

Apr 16 '24 23:04 ryaneorth

So it occurred to us that the https://github.com/authzed/spicedb/issues/1785 issue might be an avenue to solve this problem:

We add a registration API called TrackRelationshipCount (and UntrackRelationshipCount) that takes in a relationships filter
The CountRelationships API can only be called once the tracking/registration API has been invoked.
We start with the CountRelationships API temporarily invoking count(*), but with registration still required.
We build the stats service (https://github.com/authzed/spicedb/issues/1785) to provide not only internal stats stored using HLLs but also the real counts for any registered filters using the above API. Once the stats service is ready, we transition the CountRelationships API from using the temporary call-count-* to using the stats API

Apr 19 '24 18:04 josephschorr

Codifying this into a proposal:

CountRelationships Proposal

Add the ability to track and retrieve the count of relationships based on registered filters in SpiceDB. This will be used both by external users for data validation purposes and, eventually, by SpiceDB itself for use in the eventual query planner (see https://github.com/authzed/spicedb/issues/1785 and https://github.com/authzed/spicedb/issues/1573)

Proposal

SpiceDB will add the following APIs:

// RegisterRelationshipCounter registers a new filter for counting relationships. A filter must be registered before
// a count can be requested.
rpc RegisterRelationshipCounter(RegisterRelationshipCounterRequest) returns (RegisterRelationshipCounterResponse)

// CountRelationships returns the count of relationships for *pre-registered* filter.
rpc CountRelationships(CountRelationshipsRequest) returns (CountRelationshipsResponse)

// UnregisterRelationshipCounter unregisters a new filter for counting relationships.
rpc UnregisterRelationshipCounter(UnregisterRelationshipCounterRequest) returns (UnregisterRelationshipCounterResponse)

message RegisterRelationshipCounterRequest {
  // relationship_filter defines the filter to be applied to the relationships
  // to be counted.
  RelationshipFilter relationship_filter = 1
      [ (validate.rules).message.required = true ];
}

message RegisterRelationshipCounterResponse {}

message CountRelationshipsRequest {
  Consistency consistency = 1;

  // relationship_filter defines the filter to be applied to the relationships
  // to be counted.
  RelationshipFilter relationship_filter = 2
      [ (validate.rules).message.required = true ];
}

message CountRelationshipsResponse {
  // read_at is the ZedToken at which the relationship count was performed
  ZedToken read_at = 1 [ (validate.rules).message.required = true ];

  uint64 relationship_count = 2;
}

message UnregisterRelationshipCounterRequest {
  RelationshipFilter relationship_filter = 1
      [ (validate.rules).message.required = true ];
}

message UnregisterRelationshipCounterResponse {}

Workflow and Implementation

The workflow with consist of registering one (or more) relationship filters, which will be stored in the datastore.

For the first-pass, naive implementation, the CountRelationshipsRequest will issue a count(*) with the filter to the underlying datastore.

For the second-pass, proper implementation, a distinct service will be added to SpiceDB to use the watch API to keep the counts up-to-date in-memory, with CountRelationshipsRequest dispatching to the service, and the service storing the counts on an interval (5m? 10m? 30m?) in the backing datastore alongside the last revision for which it was updated

May 02 '24 19:05 josephschorr

a distinct service will be added to SpiceDB to use the watch API to keep the counts up-to-date in-memory

Instead of a distinct service, we could use the DB to coordinate a leader election among SpiceDB replicas and choose one to hold the watch / do the counts.

May 02 '24 19:05 ecordell

Thanks for putting together this official proposal @josephschorr ! Do you anticipate having any kind of guardrails in terms of the number of relationship counters that can be registered at a given time.

Also, the RelationshipFilter allows the user to specify an optional_resource_id_prefix. While this will be very useful, I could see this causing a fair amount of load on the backing datastore when performing a count. Do you have any concerns with that? To be clear: for our use case we'd want to have this optional_resource_id_prefix and are willing to accept the load it may have on the database for the initial implementation.

May 03 '24 00:05 ryaneorth

Do you anticipate having any kind of guardrails in terms of the number of relationship counters that can be registered at a given time.

Yes, there will likely be a (configurable) limit

Do you have any concerns with that?

Yes, but fortunately it should be an index scan in most cases and once there is a real counting service, it will be handled in memory

May 03 '24 01:05 josephschorr

Closing since this is now merged. A followup with a service for counting will be added if need be. Fixes https://github.com/authzed/spicedb/pull/1901

May 27 '24 01:05 josephschorr

spicedb
spicedb copied to clipboard

Proposal: Add relationship count API

Problem Statement

Solution Brainstorm

CountRelationships Proposal

Proposal

Workflow and Implementation

spicedb spicedb copied to clipboard

Proposal: Add relationship count API

Problem Statement

Solution Brainstorm

CountRelationships Proposal

Proposal

Workflow and Implementation

spicedb
spicedb copied to clipboard