replicator Add s3 populated with CDC files as a source

Add s3 populated with CDC files as a source

Open BramGruneir opened this issue 1 year ago • 0 comments

There are two main business use cases here:

Allowing a single cdc feed to push to any number of replica clusters. This would require the ability to poll or watch a bucket.
Be a log of sorts to allow for shorter RPO. Restore from backup then cdc-sink from s3 up until the latest resolved timestamp. This version doesn't require any polling or watching.

So some things to note about using s3:

for all use cases, it should take a timestamp to start at, so it will ignore earlier ones
in transactional mode, it should bypass using the staging cluster entirely, as we can use s3 as that buffer
- this could be an additional improvement after the initial version is implemented
- but this would also reduce the amount of work in staging cluster db, but we would still need to track resolved timestamps
- BUT ... that tracking could also be done in s3, but that seems like it would open up too many other issues

The downsides of using s3 are:

There is the added latency of both writing to and reading from s3, so the replication lag will be high than a direct route
This obviously requires an extra system outside of cdc/cdc-sink and that comes with more permissioning, security, etc.

Feb 05 '24 16:02 BramGruneir