sysadmin icon indicating copy to clipboard operation
sysadmin copied to clipboard

b.web-connectivity.th down for 7.6 hours

Open darkk opened this issue 7 years ago • 4 comments

Impact: TBD (AFAIK clients should use another test helper if one of them is down, but it may be not the case).

Detection: repeated email alert

Timeline UTC: 00:00 four periodic jobs start: cerbot@cron, certbot@systemd, munin plugins apt update, /data/b.web-connectivity.th.ooni.io/update-bouncer.py 00:00 2017-07-27 00:00:03,678:DEBUG:certbot.storage:Should renew, less than 30 days before certificate expiry 2017-08-25 23:01:00 UTC. 00:00 2017-07-27 00:00:03,678:INFO:certbot.hooks:Running pre-hook command: docker stop ooni-backend-b.web-connectivity.th.ooni.io 00:00 2017-07-27T00:00:03+0000 [-] Received SIGTERM, shutting down. 00:00 Jul 27 00:00:13 b dockerd[366]: time="2017-07-27T00:00:13.717964674Z" level=info msg="Container 6a4379167c880b295f7383d6eab8fc7b9e422ac1b0e6df0ab5cfefa2524fd512 failed to exit within 10 seconds of signal 15 - using the force" 00:00 2017-07-27 00:00:32,230:INFO:certbot.hooks:Running post-hook command: cp ... && docker start ooni-backend-b.web-connectivity.th.ooni.io 00:00 2017-07-27T00:00:34.710428510Z Another twistd server is running, PID 1 00:01 [FIRING] Instance https://b.web-connectivity.th.ooni.io/status down 00:34 2017-07-27 00:34:22,763:INFO:certbot.renewal:Cert not yet due for renewal 07:00 darkk@ wakes up 07:31 darkk@ logs into b.web-connectivity.th.ooni.io 07:33 2017-07-27T07:33:31.007042572Z Pidfile /oonib.pid contains non-numeric value after an attempt to truncate --size 0 /var/lib/docker/aufs/diff/096b1a00f4529b788ee6f062929dc54540b9b06171c52a8957da8bb88c1ec094/oonib.pid 07:34 2017-07-27T07:34:00.767235934Z Removing stale pidfile /oonib.pid after echo 42 >/var/lib/docker/aufs/diff/096b1a00f4529b788ee6f062929dc54540b9b06171c52a8957da8bb88c1ec094/oonib.pid 07:36 [RESOLVED] Instance https://b.web-connectivity.th.ooni.io/status down 09:50 incident published

What went well:

  • repeated alerting made it clear that the problem is ongoing
  • although I had no darkk user at the host, I had my key at root's authorized_keys
  • docker aufs makes it easy to do amendments to container FS while it is stopped, device-mapper based backend is worse in that case

What went wrong:

  • that was recidive of pid=1 in pidfile problem
  • View in AlertManager link in email and slack notifications is not clickable
  • two certbot launchers /etc/systemd/system/timers.target.wants/certbot.timer and /etc/cron.d/letsencrypt_renew_certs-... are confusing, I have not noticed one of them at first and spent some time looking for the trigger of SIGTERM
  • ooni-backend-b.web-connectivity.th.ooni.io container stores significant amount of logs inside, these logs take 7G and overall container size limit is 15G, historical logs are useful for sure, but: 1. renaming of 7400 files on rotation may be slow and twisted is single-threaded, 2. they should not overflow container disk quota
  • twisted logs do not contain milliseconds, those are useful as a helper to distinguish two nearby events from event producing several log lines. docker logs stores nanoseconds, but oonib does not log to stdout (docker)

What is still unclear and should be discussed:

  • [ ] impact. Do applications really fallback to another test helper?
  • [ ] do we need more insistent notification system that can wake people up?

What could be done to prevent relapse and decrease impact:

  • [ ] move letsencrypt updates to team "business hours", people sleep at UTC midnight
  • [ ] avoid stale pid files somehow: cleanup twisted PID file on container start || grab flock on pid || randomize daemon pid so it's not pid=1
  • [x] fix AlertManager link
  • [ ] fix two certbot launchers
  • [ ] limit number of /var/log/ooni/oonibackend.log.* files in the container
  • [ ] make /var/log/ooni/oonibackend.log.* files renamed once (e.g. oonibackend.log.${unix_timestamp})
  • [ ] milli- or microseconds in twisted logs || logging to stdout (to docker)
  • [ ] ensure that same storage driver is used across all containers #96

darkk avatar Jul 27 '17 09:07 darkk

AFAIK clients should use another test helper if one of them is down, but it may be not the case

This is what ooni-probe does, but MK is lagging behind in this respect.

In particular, my understanding is that (@hellais correct me if I'm wrong) the bouncer returns three different collectors and/or test helpers (if any):

  1. the onion one
  2. the https one
  3. the cloudfronted one

Regarding specifically MK: it cannot use the onion one; it will use the https one; with some more hammering it will also be able to use the cloudfronted one.

Question: how relevant is the fact that a client will retry with another helper in light of the fact that the three returned helpers (collectors) may all point to the same VM implementation?

I mean, let's assume that the bouncer returns:

  • A.onion pointing to VM A
  • A.https pointing to VM A
  • A.cloudfronted pointing to VM A

In such case, what do we gain by retrying all of them, if the breakage is caused by A being down?

  • [ ] Is the bouncer smart enough to return helpers (or collectors) belonging to different VMs (e.g. A.onion, B.https, C.cloudfronted) so to increase diversity?

  • [ ] Is the bouncer able to return more than one entry for each category?

  • [ ] Is the bouncer somehow connected to alerts so that it reasonably knows that A is up when it returns A.onion, A.https and A.cloudfronted?

bassosimone avatar Aug 16 '17 13:08 bassosimone

Relapse at b.web-connectivity.th.ooni.io, timeline UTC: 25 Sep 00:00 2017-09-25T00:00:40.448510440Z Another twistd server is running, PID 1 25 Sep 00:05 <AlertManager> [FIRING:1] InstanceDown 25 Sep 07:36 2017-09-25T07:36:01.047083733Z Removing stale pidfile /oonib.pid 25 Sep 07:40 <AlertManager> [RESOLVED] InstanceDown

darkk avatar Sep 25 '17 08:09 darkk

This relapsed again twice recently:

01:11 18th November 2019 UTC+2 [FIRING] mia-wcth.ooni.io:9100 had 1 OOMs in 5m [FIRING] http://y3zq5fwelrzkkv3s.onion/status endpoint down [FIRING] http://2lzg3f4r3eapfi6j.onion/status endpoint down [RESOLVED] mia-wcth.ooni.io:9100 had 1 OOMs in 5m

hellais 11:33 18th November 2019 UTC+2 [RESOLVED] https://mia-wcth.ooni.io/status endpoint down This was actually down. The issue was: This could either be a previously started instance of your application or a different application entirely. To start a new one, either run it in some other directory, or use the --pidfile and --logfile parameters to avoid clashes. Another twistd server is running, PID 1 The fix was to delete /srv/web_connectivity/oonib.pid and restart the docker container

and then

00:34 21st November 2019 UTC+2 [FIRING] https://mia-wcth.ooni.io/status endpoint down

10:29 21st November 2019 UTC+2 Damn the mia-wcth went down again due to: https://openobservatory.slack.com/archives/C38EJ0CET/p1574073207161400 10:29 [RESOLVED] https://mia-wcth.ooni.io/status endpoint down

hellais avatar Nov 21 '19 09:11 hellais

Relapsed again:

[FIRING] https://mia-wcth.ooni.io/status endpoint down
docker stop 1ca0dce565e5
rm /srv/web_connectivity/oonib.pid
docker restart 1ca0dce565e5
docker logs 1ca0dce565e5    # show Starting web_connectivity helper...
tail /srv/web_connectivity/logs/oonibackend.log -f   # unstuck
[RESOLVED] mia-run.ooni.nu:9100 is not OK, check `systemctl list-units | grep failed`

FedericoCeratto avatar Dec 29 '19 14:12 FedericoCeratto