sysadmin
sysadmin copied to clipboard
b.echo.th.ooni.io possibly down for 8 hours
Impact: 8h 50m downtime of b.echo.th.ooni.io test helper (?)
Detection: CPUHigh alert with expected 8h delay
Timeline UTC:
17 Nov 07:30 CPU spikes to 100%, thats accept()
vs. EMFILE busy loop
17 Nov 15:34 CPUHigh
alert firing
17 Nov 15:48 @darkk logs into the VM, confirms 100% CPU
17 Nov 16:14 @darkk logs into the VM, looks at oonib
, reboots the VM
17 Nov 16:20 everything recovers to normal
What went well:
- resource utilisation alerts are actually useful!
What went wrong:
- oonib was slowly leaking sockets at port tcp/57002 (TCPEchoHelper)
- 1004 connections were enough to kill the daemon, these connections were coming from 395 distinct IPs, only 99 IPs had more than one connection, only 17 were >10, top5 IPs had {55,52,33,32,32} connections
-
status
was reporting nothing for init script ,reboot
was an "easy" way to restrat the service
ooni-backend Status
Listing all oonib procs
No running oonib procs
What is still unclear:
- was the service actually down? Seems, it should be, but no other alerts besides CPUHigh were triggered
What could be done to prevent relapse and decrease impact:
- [ ] increase FD limit
- [ ] preventive restart (?)
- [ ] add TCP_KEEPALIVE with low timeout values for the endpoints (?)
- [ ] monitoring for the service itself besides TCP port check (?)
Relapse. Timeline UTC: 14 Feb 22:50 CPU spikes to 100% 15 Feb 08:15 everything recovers
Relapse. Timeline UTC:
2019-05-03T17:29:30Z CPU spikes
2019-05-04T01:31:00Z alert fires
2019-05-04T07:38:00Z @bassosimone notices and asks for guidance
2019-05-04T09:28:00Z @darkk suggests to search for issues in this repo
2019-05-04T10:14:00Z issue has been found; incident still ongoing
2019-05-04T10:18:00Z @bassosimone reboots the machine; top
is happier
2019-05-04T10:22:00Z alerts are resolved