replicator
replicator copied to clipboard
Add s3 populated with CDC files as a source
There are two main business use cases here:
- Allowing a single cdc feed to push to any number of replica clusters. This would require the ability to poll or watch a bucket.
- Be a log of sorts to allow for shorter RPO. Restore from backup then cdc-sink from s3 up until the latest resolved timestamp. This version doesn't require any polling or watching.
So some things to note about using s3:
- for all use cases, it should take a timestamp to start at, so it will ignore earlier ones
- in transactional mode, it should bypass using the staging cluster entirely, as we can use s3 as that buffer
- this could be an additional improvement after the initial version is implemented
- but this would also reduce the amount of work in staging cluster db, but we would still need to track resolved timestamps
- BUT ... that tracking could also be done in s3, but that seems like it would open up too many other issues
The downsides of using s3 are:
- There is the added latency of both writing to and reading from s3, so the replication lag will be high than a direct route
- This obviously requires an extra system outside of cdc/cdc-sink and that comes with more permissioning, security, etc.