peerdb icon indicating copy to clipboard operation
peerdb copied to clipboard

Granular status tracking

Open ilidemi opened this issue 5 months ago • 0 comments

Track snapshot/sync/normalize/slot lag status in a granular way - sync and normalize can fail and recover independently, QRep runs and partitions can fail and recover independently. It will eventually get rolled up into mirror-level status:degraded.

Every error in Snapshot and running CDC flow is now reported to flow_errors. Calling alerter logging is unified to the top level activity level rather than being sprinkled in across the code.

Todo:

  • [ ] Add the extra lookup fields and indices into flow_errors
  • [ ] Double check PG writes are reliable
  • [ ] Report the new status and errors in MirrorStatus
  • [ ] Double check the lifecycles of status values
  • [ ] What will happen when something is erroring out then there's a signal to switch, will there be a blip of stale status in the future
  • [ ] Why are we treating ApplicationErrors and replState changed/slot is already active in a special way?
  • [ ] Either log internal errors as internal errors and remove _is_internal_error or filter them out and use the field
  • [ ] Slot lag threshold should be coming from the individual overrides in catalog
  • [ ] Thresholding like IMR does - either move it over or integrate
  • [ ] Testing

ilidemi avatar Jul 25 '25 10:07 ilidemi