monorepo icon indicating copy to clipboard operation
monorepo copied to clipboard

Alerting for System Failures

Open LayneHaber opened this issue 2 years ago • 11 comments

Background

Our infrastructure has various subsystems and components that can fail and need to be monitored. We need to be immediately alerted if any of these systems go down so we can take immediate actions on it.

Users should be shielded from these errors NOT our source of identification.

Impact

We don't have clear metrics around this, but our time to identify and respond to some SEVs this past quarter was delayed.

Ex/ Post-mortem -- Arbitrum Bridge stopped sending transactions (March 11)

Rahul was monitoring Discord and noticed user complaints around slow txs around 12pm Dubai time.

Impacts Retention OKRs around Transfer Success Rate and likely also Retention Rate and NPS

Proposed Solution

https://www.notion.so/connext/Alerting-03e2df6aefbd4307a0c8a3983118da89?pvs=4

LayneHaber avatar Mar 15 '23 14:03 LayneHaber

Initial spec: https://www.notion.so/connext/Alerting-03e2df6aefbd4307a0c8a3983118da89?pvs=4

rhlsthrm avatar Mar 21 '23 11:03 rhlsthrm

Complexity: 21 See the discussion in Backlog Prio 1

Powered by Parabol

alexwhte avatar Mar 30 '23 14:03 alexwhte

Impact: 3 See the discussion in Backlog Prio 1

Powered by Parabol

alexwhte avatar Mar 30 '23 14:03 alexwhte

Urgency: 3 See the discussion in Backlog Prio 1

Powered by Parabol

alexwhte avatar Mar 30 '23 14:03 alexwhte

Complexity: 13 See the discussion in Sprint Poker #​​22

Powered by Parabol

alexwhte avatar Apr 03 '23 14:04 alexwhte

Also see: https://github.com/connext/monorepo/issues/2953

preethamr avatar Apr 12 '23 15:04 preethamr

Complexity: 13 See the discussion in Sprint Poker #​​29

Powered by Parabol

alexwhte avatar Apr 13 '23 16:04 alexwhte

Impact: 4 See the discussion in Sprint Poker #​​30

Powered by Parabol

alexwhte avatar Apr 13 '23 17:04 alexwhte

Urgency: 2 See the discussion in Sprint Poker #​​30

Powered by Parabol

alexwhte avatar Apr 13 '23 17:04 alexwhte

Divide into smaller task. Convert into epic. First main task, alert on slow path delays.

preethamr avatar May 03 '23 15:05 preethamr

We still need this

preethamr avatar Feb 08 '24 05:02 preethamr