replicator icon indicating copy to clipboard operation
replicator copied to clipboard

Add s3 populated with CDC files as a source

Open BramGruneir opened this issue 1 year ago • 0 comments

There are two main business use cases here:

  1. Allowing a single cdc feed to push to any number of replica clusters. This would require the ability to poll or watch a bucket.
  2. Be a log of sorts to allow for shorter RPO. Restore from backup then cdc-sink from s3 up until the latest resolved timestamp. This version doesn't require any polling or watching.

So some things to note about using s3:

  • for all use cases, it should take a timestamp to start at, so it will ignore earlier ones
  • in transactional mode, it should bypass using the staging cluster entirely, as we can use s3 as that buffer
    • this could be an additional improvement after the initial version is implemented
    • but this would also reduce the amount of work in staging cluster db, but we would still need to track resolved timestamps
    • BUT ... that tracking could also be done in s3, but that seems like it would open up too many other issues

The downsides of using s3 are:

  • There is the added latency of both writing to and reading from s3, so the replication lag will be high than a direct route
  • This obviously requires an extra system outside of cdc/cdc-sink and that comes with more permissioning, security, etc.

BramGruneir avatar Feb 05 '24 16:02 BramGruneir