sysadmin b.echo.th.ooni.io possibly down for 8 hours

Impact: 8h 50m downtime of b.echo.th.ooni.io test helper (?)

Detection: CPUHigh alert with expected 8h delay

Timeline UTC: 17 Nov 07:30 CPU spikes to 100%, thats accept() vs. EMFILE busy loop 17 Nov 15:34 CPUHigh alert firing 17 Nov 15:48 @darkk logs into the VM, confirms 100% CPU 17 Nov 16:14 @darkk logs into the VM, looks at oonib, reboots the VM 17 Nov 16:20 everything recovers to normal

What went well:

resource utilisation alerts are actually useful!

What went wrong:

oonib was slowly leaking sockets at port tcp/57002 (TCPEchoHelper)
1004 connections were enough to kill the daemon, these connections were coming from 395 distinct IPs, only 99 IPs had more than one connection, only 17 were >10, top5 IPs had {55,52,33,32,32} connections
status was reporting nothing for init script , reboot was an "easy" way to restrat the service

ooni-backend Status
Listing all oonib procs
No running oonib procs

What is still unclear:

was the service actually down? Seems, it should be, but no other alerts besides CPUHigh were triggered

What could be done to prevent relapse and decrease impact:

[ ] increase FD limit
[ ] preventive restart (?)
[ ] add TCP_KEEPALIVE with low timeout values for the endpoints (?)
[ ] monitoring for the service itself besides TCP port check (?)

Nov 20 '18 16:11 darkk

Relapse. Timeline UTC: 14 Feb 22:50 CPU spikes to 100% 15 Feb 08:15 everything recovers

Feb 15 '19 08:02 darkk

Relapse. Timeline UTC:

2019-05-03T17:29:30Z CPU spikes 2019-05-04T01:31:00Z alert fires 2019-05-04T07:38:00Z @bassosimone notices and asks for guidance 2019-05-04T09:28:00Z @darkk suggests to search for issues in this repo 2019-05-04T10:14:00Z issue has been found; incident still ongoing 2019-05-04T10:18:00Z @bassosimone reboots the machine; top is happier 2019-05-04T10:22:00Z alerts are resolved

May 04 '19 10:05 bassosimone

sysadmin sysadmin copied to clipboard

b.echo.th.ooni.io possibly down for 8 hours

sysadmin
sysadmin copied to clipboard