lilypad icon indicating copy to clipboard operation
lilypad copied to clipboard

[core] Alerting when services go down

Open AquiGorka opened this issue 1 year ago • 1 comments

Figure out what would be the best way for us to learn when the solver, job creator and/or chain services stop working for whatever reason.

AquiGorka avatar May 22 '24 14:05 AquiGorka

We have started this effort with:

  • [x] EC2 status checks
  • [x] Cloudflare tunnel alerts when tunnel goes down

Next steps:

  • [ ] Implement Cloudflare tunnel alerts in OpenTofu
  • [ ] Add a periodic cowsay job to check network liveness
  • [ ] Alerts from our observability stack once live

bgins avatar Jul 01 '24 21:07 bgins