spicedb icon indicating copy to clipboard operation
spicedb copied to clipboard

Using Read Replica on AWS Aurora Postgres causing errors

Open epbensimpson opened this issue 5 months ago • 5 comments

What platforms are affected?

linux

What architectures are affected?

amd64

What SpiceDB version are you using?

v1.44.4

Also ran into this problem back on 1.42.x and was hoping it had been fixed in the interim.

Steps to Reproduce

  • AWS Aurora Postgres cluster with single writer and single reader (db.r6g.xlarge)
  • SpiceDB configured to use the Reader Endpoint of the cluster via SPICEDB_DATASTORE_CONN_URI
  • Follower delay SPICEDB_DATASTORE_FOLLOWER_READ_DELAY_DURATION=2000ms
  • Multiple SpiceDB tasks running (2-5)
  • Dispatch enabled

I attempted to reproduce the issue on a test environment using Thumper but was unable to see any errors after 1m+ requests 😓 My script was both checking permissions (weight 10) and writing relations (weight 1) to ensure new revisions were being created, doing 500QPS

Expected Result

No errors due to using a read replica when the follower delay is greater than the maximum replication latency

Actual Result

Example logs for a single request:

{
    "requestID": "d2aie7vv6t0moif2fshg",
    "time": "2025-08-07T22:22:23Z",
    "level": "error",
    "source": "stderr",
    "error": {
        "error": "object definition `auth/platform` not found",
        "namespace": "auth/platform"
    },
    "message": "received unexpected graph error"
}
{
    "requestID": "d2aie7vv6t0moif2fshg",
    "time": "2025-08-07T22:22:23Z",
    "level": "error",
    "source": "stderr",
    "error": {
        "error": "object definition `auth/platform` not found",
        "namespace": "auth/platform"
    },
    "message": "unexpected dispatch graph error"
}
{
    "grpc.method": "DispatchCheck",
    "grpc.method_type": "unary",
    "peer.address": "10.0.2.197:57466",
    "grpc.start_time": "2025-08-07T22:22:23Z",
    "grpc.request.deadline": "2025-08-07T22:23:23Z",
    "source": "stderr",
    "grpc.code": "Unknown",
    "protocol": "grpc",
    "grpc.error": "object definition `auth/platform` not found",
    "grpc.time_ms": 3,
    "requestID": "d2aie7vv6t0moif2fshg",
    "time": "2025-08-07T22:22:23Z",
    "level": "error",
    "traceID": "8550e95f4f8d5a26eeeb9ad8900ac67f",
    "message": "finished call",
    "grpc.component": "server",
    "grpc.service": "dispatch.v1.DispatchService"
}
{
    "grpc.start_time": "2025-08-07T22:22:23Z",
    "grpc.code": "Unknown",
    "grpc.error": "rpc error: code = Unknown desc = object definition `auth/platform` not found",
    "grpc.request.deadline": "2025-08-07T22:23:23Z",
    "time": "2025-08-07T22:22:23Z",
    "source": "stderr",
    "message": "finished call",
    "traceID": "8550e95f4f8d5a26eeeb9ad8900ac67f",
    "grpc.component": "server",
    "grpc.service": "dispatch.v1.DispatchService",
    "protocol": "grpc",
    "grpc.method_type": "unary",
    "grpc.time_ms": 7,
    "requestID": "d2aie7vv6t0moif2fshg",
    "level": "error",
    "peer.address": "10.0.2.197:57268",
    "grpc.method": "DispatchCheck"
}
{
    "grpc.start_time": "2025-08-07T22:22:23Z",
    "grpc.code": "Unknown",
    "grpc.error": "rpc error: code = Unknown desc = object definition `auth/platform` not found",
    "grpc.time_ms": 10,
    "time": "2025-08-07T22:22:23Z",
    "source": "stderr",
    "traceID": "8550e95f4f8d5a26eeeb9ad8900ac67f",
    "grpc.component": "server",
    "grpc.service": "authzed.api.v1.PermissionsService",
    "protocol": "grpc",
    "grpc.method_type": "unary",
    "message": "finished call",
    "requestID": "d2aie7vv6t0moif2fshg",
    "level": "error",
    "peer.address": "10.0.5.80:47516",
    "grpc.method": "CheckPermission"
}

Maximum replication latency:

Image

Only started seeing any Unknown/FailedPrecondition errors when the replica was enabled:

Image Image

epbensimpson avatar Aug 07 '25 23:08 epbensimpson

Is the watching schema cache enabled?

josephschorr avatar Aug 08 '25 08:08 josephschorr

Is the watching schema cache enabled?

With enable-experimental-watchable-schema-cache? Nope, we don't have postgres configured to enable the Watch API either

Edit: by "don't have postgres configured" I mean watch API disabled, postgres must be run with track_commit_timestamp=on

epbensimpson avatar Aug 10 '25 21:08 epbensimpson

I've found that this .NET Client has a bug where it's setting fully_consistent=true when it should be sending minimize_latency=true 😩 I have no idea if that has any bearing on this issue but certainly isn't helping

Edit: I updated the Thumper script to use consistency: FullyConsistent and ran it against a test environment with read replica enabled and it reproduced the error

The script:

name: spam user#view checks
weight: 10
steps:
  - op: CheckPermission
    resource: auth/user:bob
    subject: auth/user:{{ randomObjectID }}
    permission: view
    expectNoPermission: true
    consistency: FullyConsistent
  - op: CheckPermission
    resource: auth/organisation:bob
    subject: auth/user:{{ randomObjectID }}
    permission: view
    expectNoPermission: true
    consistency: FullyConsistent
---
name: create new revisions
weight: 1
steps:
  - op: CheckPermission
    resource: auth/user:{{ randomObjectID }}
    subject: auth/user:{{ randomObjectID }}
    permission: view
    expectNoPermission: true
  - op: WriteRelationships
    updates:
      - op: TOUCH
        resource: auth/user:{{ randomObjectID }}
        subject: auth/user:{{ randomObjectID }}
        relation: this_user
  - op: CheckPermission
    resource: auth/user:{{ randomObjectID }}
    subject: auth/user:{{ randomObjectID }}
    permission: view
    consistency: FullyConsistent
  - op: WriteRelationships
    updates:
      - op: DELETE
        resource: auth/user:{{ randomObjectID }}
        subject: auth/user:{{ randomObjectID }}
        relation: this_user

Schema (simplified):

definition auth/user {
  relation this_user: user

  permission view = this_user
}

definition auth/organisation {
   relation member: user

   permission view = member
}

Obviously our real schema is a lot more complex but hopefully that's not relevant for reproduceability

epbensimpson avatar Aug 12 '25 00:08 epbensimpson

@epbensimpson re dotnet client: have you considered using the official one? https://github.com/authzed/authzed-dotnet

Edit: I updated the Thumper script to use consistency: FullyConsistent and ran it against a test environment with read replica enabled and it reproduced the error

This seems odd: read-replicas are not used when fully-consistent is used.

vroldanbet avatar Aug 25 '25 12:08 vroldanbet

@epbensimpson re dotnet client: have you considered using the official one? https://github.com/authzed/authzed-dotnet

Yep, I switched over to that already.

epbensimpson avatar Aug 26 '25 21:08 epbensimpson

It seems I'm seeing similar behavior with a cloudnative-pg cluster in a local cluster. I've also seen this is fully-consistent and we are also not using watch API or any experimental functionality. We are using the official Java client. After removing the read replica from SpiceDB everything returned to normal.

pschichtel avatar Dec 15 '25 23:12 pschichtel