peerdb
peerdb copied to clipboard
Granular status tracking
Track snapshot/sync/normalize/slot lag status in a granular way - sync and normalize can fail and recover independently, QRep runs and partitions can fail and recover independently. It will eventually get rolled up into mirror-level status:degraded.
Every error in Snapshot and running CDC flow is now reported to flow_errors. Calling alerter logging is unified to the top level activity level rather than being sprinkled in across the code.
Todo:
- [ ] Add the extra lookup fields and indices into flow_errors
- [ ] Double check PG writes are reliable
- [ ] Report the new status and errors in MirrorStatus
- [ ] Double check the lifecycles of status values
- [ ] What will happen when something is erroring out then there's a signal to switch, will there be a blip of stale status in the future
- [ ] Why are we treating ApplicationErrors and replState changed/slot is already active in a special way?
- [ ] Either log internal errors as internal errors and remove _is_internal_error or filter them out and use the field
- [ ] Slot lag threshold should be coming from the individual overrides in catalog
- [ ] Thresholding like IMR does - either move it over or integrate
- [ ] Testing