monorepo
monorepo copied to clipboard
Alerting for System Failures
Background
Our infrastructure has various subsystems and components that can fail and need to be monitored. We need to be immediately alerted if any of these systems go down so we can take immediate actions on it.
Users should be shielded from these errors NOT our source of identification.
Impact
We don't have clear metrics around this, but our time to identify and respond to some SEVs this past quarter was delayed.
Ex/ Post-mortem -- Arbitrum Bridge stopped sending transactions (March 11)
Rahul was monitoring Discord and noticed user complaints around slow txs around 12pm Dubai time.
Impacts Retention OKRs around Transfer Success Rate and likely also Retention Rate and NPS
Proposed Solution
https://www.notion.so/connext/Alerting-03e2df6aefbd4307a0c8a3983118da89?pvs=4
Initial spec: https://www.notion.so/connext/Alerting-03e2df6aefbd4307a0c8a3983118da89?pvs=4
Also see: https://github.com/connext/monorepo/issues/2953
Divide into smaller task. Convert into epic. First main task, alert on slow path delays.
We still need this