Using Read Replica on AWS Aurora Postgres causing errors
What platforms are affected?
linux
What architectures are affected?
amd64
What SpiceDB version are you using?
v1.44.4
Also ran into this problem back on 1.42.x and was hoping it had been fixed in the interim.
Steps to Reproduce
- AWS Aurora Postgres cluster with single writer and single reader (
db.r6g.xlarge) - SpiceDB configured to use the Reader Endpoint of the cluster via
SPICEDB_DATASTORE_CONN_URI - Follower delay
SPICEDB_DATASTORE_FOLLOWER_READ_DELAY_DURATION=2000ms - Multiple SpiceDB tasks running (2-5)
- Dispatch enabled
I attempted to reproduce the issue on a test environment using Thumper but was unable to see any errors after 1m+ requests 😓 My script was both checking permissions (weight 10) and writing relations (weight 1) to ensure new revisions were being created, doing 500QPS
Expected Result
No errors due to using a read replica when the follower delay is greater than the maximum replication latency
Actual Result
Example logs for a single request:
{
"requestID": "d2aie7vv6t0moif2fshg",
"time": "2025-08-07T22:22:23Z",
"level": "error",
"source": "stderr",
"error": {
"error": "object definition `auth/platform` not found",
"namespace": "auth/platform"
},
"message": "received unexpected graph error"
}
{
"requestID": "d2aie7vv6t0moif2fshg",
"time": "2025-08-07T22:22:23Z",
"level": "error",
"source": "stderr",
"error": {
"error": "object definition `auth/platform` not found",
"namespace": "auth/platform"
},
"message": "unexpected dispatch graph error"
}
{
"grpc.method": "DispatchCheck",
"grpc.method_type": "unary",
"peer.address": "10.0.2.197:57466",
"grpc.start_time": "2025-08-07T22:22:23Z",
"grpc.request.deadline": "2025-08-07T22:23:23Z",
"source": "stderr",
"grpc.code": "Unknown",
"protocol": "grpc",
"grpc.error": "object definition `auth/platform` not found",
"grpc.time_ms": 3,
"requestID": "d2aie7vv6t0moif2fshg",
"time": "2025-08-07T22:22:23Z",
"level": "error",
"traceID": "8550e95f4f8d5a26eeeb9ad8900ac67f",
"message": "finished call",
"grpc.component": "server",
"grpc.service": "dispatch.v1.DispatchService"
}
{
"grpc.start_time": "2025-08-07T22:22:23Z",
"grpc.code": "Unknown",
"grpc.error": "rpc error: code = Unknown desc = object definition `auth/platform` not found",
"grpc.request.deadline": "2025-08-07T22:23:23Z",
"time": "2025-08-07T22:22:23Z",
"source": "stderr",
"message": "finished call",
"traceID": "8550e95f4f8d5a26eeeb9ad8900ac67f",
"grpc.component": "server",
"grpc.service": "dispatch.v1.DispatchService",
"protocol": "grpc",
"grpc.method_type": "unary",
"grpc.time_ms": 7,
"requestID": "d2aie7vv6t0moif2fshg",
"level": "error",
"peer.address": "10.0.2.197:57268",
"grpc.method": "DispatchCheck"
}
{
"grpc.start_time": "2025-08-07T22:22:23Z",
"grpc.code": "Unknown",
"grpc.error": "rpc error: code = Unknown desc = object definition `auth/platform` not found",
"grpc.time_ms": 10,
"time": "2025-08-07T22:22:23Z",
"source": "stderr",
"traceID": "8550e95f4f8d5a26eeeb9ad8900ac67f",
"grpc.component": "server",
"grpc.service": "authzed.api.v1.PermissionsService",
"protocol": "grpc",
"grpc.method_type": "unary",
"message": "finished call",
"requestID": "d2aie7vv6t0moif2fshg",
"level": "error",
"peer.address": "10.0.5.80:47516",
"grpc.method": "CheckPermission"
}
Maximum replication latency:
Only started seeing any Unknown/FailedPrecondition errors when the replica was enabled:
Is the watching schema cache enabled?
Is the watching schema cache enabled?
With enable-experimental-watchable-schema-cache? Nope, we don't have postgres configured to enable the Watch API either
Edit: by "don't have postgres configured" I mean watch API disabled, postgres must be run with track_commit_timestamp=on
I've found that this .NET Client has a bug where it's setting fully_consistent=true when it should be sending minimize_latency=true 😩 I have no idea if that has any bearing on this issue but certainly isn't helping
Edit: I updated the Thumper script to use consistency: FullyConsistent and ran it against a test environment with read replica enabled and it reproduced the error
The script:
name: spam user#view checks
weight: 10
steps:
- op: CheckPermission
resource: auth/user:bob
subject: auth/user:{{ randomObjectID }}
permission: view
expectNoPermission: true
consistency: FullyConsistent
- op: CheckPermission
resource: auth/organisation:bob
subject: auth/user:{{ randomObjectID }}
permission: view
expectNoPermission: true
consistency: FullyConsistent
---
name: create new revisions
weight: 1
steps:
- op: CheckPermission
resource: auth/user:{{ randomObjectID }}
subject: auth/user:{{ randomObjectID }}
permission: view
expectNoPermission: true
- op: WriteRelationships
updates:
- op: TOUCH
resource: auth/user:{{ randomObjectID }}
subject: auth/user:{{ randomObjectID }}
relation: this_user
- op: CheckPermission
resource: auth/user:{{ randomObjectID }}
subject: auth/user:{{ randomObjectID }}
permission: view
consistency: FullyConsistent
- op: WriteRelationships
updates:
- op: DELETE
resource: auth/user:{{ randomObjectID }}
subject: auth/user:{{ randomObjectID }}
relation: this_user
Schema (simplified):
definition auth/user {
relation this_user: user
permission view = this_user
}
definition auth/organisation {
relation member: user
permission view = member
}
Obviously our real schema is a lot more complex but hopefully that's not relevant for reproduceability
@epbensimpson re dotnet client: have you considered using the official one? https://github.com/authzed/authzed-dotnet
Edit: I updated the Thumper script to use consistency: FullyConsistent and ran it against a test environment with read replica enabled and it reproduced the error
This seems odd: read-replicas are not used when fully-consistent is used.
@epbensimpson re dotnet client: have you considered using the official one? https://github.com/authzed/authzed-dotnet
Yep, I switched over to that already.
It seems I'm seeing similar behavior with a cloudnative-pg cluster in a local cluster. I've also seen this is fully-consistent and we are also not using watch API or any experimental functionality. We are using the official Java client. After removing the read replica from SpiceDB everything returned to normal.