Granular status tracking

Open ilidemi opened this issue 5 months ago • 0 comments

Track snapshot/sync/normalize/slot lag status in a granular way - sync and normalize can fail and recover independently, QRep runs and partitions can fail and recover independently. It will eventually get rolled up into mirror-level status:degraded.

Every error in Snapshot and running CDC flow is now reported to flow_errors. Calling alerter logging is unified to the top level activity level rather than being sprinkled in across the code.

Todo:

[ ] Add the extra lookup fields and indices into flow_errors
[ ] Double check PG writes are reliable
[ ] Report the new status and errors in MirrorStatus
[ ] Double check the lifecycles of status values
[ ] What will happen when something is erroring out then there's a signal to switch, will there be a blip of stale status in the future
[ ] Why are we treating ApplicationErrors and replState changed/slot is already active in a special way?
[ ] Either log internal errors as internal errors and remove _is_internal_error or filter them out and use the field
[ ] Slot lag threshold should be coming from the individual overrides in catalog
[ ] Thresholding like IMR does - either move it over or integrate
[ ] Testing

Jul 25 '25 10:07 ilidemi