Internet.nl
Internet.nl copied to clipboard
Redis connection leak on eventlet workers
This issue: https://github.com/celery/celery/issues/6819
Starting empty (hello world) tasks creates a connection which stays open indefinitely: would expect that after a task is done (an eventlet thread) that it would be cleaned up. That is not the case, over time the amount of open connections grows until file limits or memory limits are reached. The current workaround is to restart the workers every N hours.
The issue occurs without even hitting our code: simple tasks that do an IPv6 resolve already show leaking connections. It's possible to see what the workers do while being in this state, which is just the following:
strace -p 19321
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=720383250}) = 0
epoll_wait(4, [], 1023, 0) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=720658417}) = 0
clock_gettime(CLOCK_REALTIME, {tv_sec=1647532476, tv_nsec=47295612}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=721018067}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=721183608}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=721325981}) = 0
epoll_wait(4, [], 1023, 0) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=721601382}) = 0
clock_gettime(CLOCK_REALTIME, {tv_sec=1647532476, tv_nsec=48239579}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=721955062}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=722120702}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=722264024}) = 0
epoll_wait(4, [], 1023, 0) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=722537283}) = 0
clock_gettime(CLOCK_REALTIME, {tv_sec=1647532476, tv_nsec=49175297}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=722893385}) = 0
clock_gettime(CLOCK_REALTIME, {tv_sec=1647532476, tv_nsec=49543284}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=723258824}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=723418006}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=723561222}) = 0
epoll_wait(4, [], 1023, 0) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=723833521}) = 0
clock_gettime(CLOCK_REALTIME, {tv_sec=1647532476, tv_nsec=50481187}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=724199070}) = 0
clock_gettime(CLOCK_REALTIME, {tv_sec=1647532476, tv_nsec=50850786}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=724566930}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=724726221}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=724867491}) = 0
epoll_wait(4, [], 1023, 0) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=725141201}) = 0
clock_gettime(CLOCK_REALTIME, {tv_sec=1647532476, tv_nsec=51777645}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=725493782}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=725652914}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=725794919}) = 0
epoll_wait(4, [], 1023, 323) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563620, tv_nsec=50006472}) = 0
epoll_ctl(4, EPOLL_CTL_DEL, 13, 0x7ffe8b1905d4) = 0
recvfrom(13, 0x7fdfbca01b30, 7, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563620, tv_nsec=51768181}) = 0
epoll_ctl(4, EPOLL_CTL_ADD, 13, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=13, u64=140595754434573}}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563620, tv_nsec=52384110}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563620, tv_nsec=52781898}) = 0
This is happening while redis has closed the connection.
lsof -p 19321
python3.7 19321 internetnl 279u IPv4 15982677 0t0 TCP localhost:40472->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl 280u IPv4 15982681 0t0 TCP localhost:40476->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl 281u IPv4 15982684 0t0 TCP localhost:40478->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl 282u IPv4 15989159 0t0 TCP localhost:42238->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl 283u IPv4 15983046 0t0 TCP localhost:41126->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl 284u IPv4 15982999 0t0 TCP localhost:41060->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl 285u IPv4 15985675 0t0 TCP localhost:41050->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl 286u IPv4 15985676 0t0 TCP localhost:41052->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl 287u IPv4 15983000 0t0 TCP localhost:41062->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl 288u IPv4 15983047 0t0 TCP localhost:41132->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl 289u IPv4 15983048 0t0 TCP localhost:41134->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl 290u IPv4 15984932 0t0 TCP localhost:41280->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl 291u IPv4 15984933 0t0 TCP localhost:41282->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl 292u IPv4 15984934 0t0 TCP localhost:41284->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl 293u IPv4 15984935 0t0 TCP localhost:41286->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl 294u IPv4 15990893 0t0 TCP localhost:42264->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl 295u IPv4 15985915 0t0 TCP localhost:41340->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl 296u IPv4 15989160 0t0 TCP localhost:42240->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl 297u IPv4 15989161 0t0 TCP localhost:42244->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl 298u IPv4 15990894 0t0 TCP localhost:42266->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl 299u IPv4 15988579 0t0 TCP localhost:42268->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl 300u IPv4 15988580 0t0 TCP localhost:42270->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl 302u IPv4 15985916 0t0 TCP localhost:41342->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl 307u IPv4 15985917 0t0 TCP localhost:41350->localhost:6379 (CLOSE_WAIT)
We tried disabling hiredis, this has no effect. We tried removing any dependency on our code: no effect. A next option would be to check if we're able to achieve the same behavior on another machine, and further reduce the issue surface. Till will take some time.
@stitch are there any new developments with this? Currently we work around this periodic restarts, pending an upstream fix?
Status:
https://github.com/internetstandards/Internet.nl/tree/main/documentation/example_configuration/opt_internetnl_etc/cron
@stitch: these Celery things can indeed be tricky to solve. The quick fix of doing a time based cron restart every 30 minutes: https://github.com/internetstandards/Internet.nl/blob/fe7c1fad6d67d61183fc5d700813008140646add/documentation/example_configuration/opt_internetnl_etc/cron/batch_crontab#L7 Is insufficient during high load, since time != tasks. Wouldn't it make more sense setting some max tasks a worker process can execute before respawning, to work around the connection leak? So in the config set some sensible N for the max tasks (also based on the amount of workers):
CELERYD_OPTS=--max-tasks-per-child N
During debugging this issue we found that even with just regular 'resolve' tasks and nothing else, connections stay open after the task has finished.
The behavior is: task runs, finishes, results are returned, but the greenthread stays open with a file handle to redis. I could not reproduce this locally but also don't have the exact same setup. This is why redis is now configured to drop connections after a while.
I suspect that if this is a non-os/platform issue this will turn up in a containerized version of internet.nl.
Perhaps #890 will magically cause this to disappear, or allow us to make a smaller test case.
This issue seems resolved by switching to Celery 5.3.1 and enabling CELERY_RESULT_BACKEND_THREAD_SAFE
in settings.py
. Changes are in the #890 PR.
@aequitas: to go from short cron restarts to zero restarts, might be a bit tricky, would it be a wise fail safe to still implement https://github.com/internetstandards/Internet.nl/issues/676#issuecomment-1386070619? Or are the childs in our case still running in one process / isn't there a benefit to do this?
@bwbroersma I'm using the ipv6kaartmetingen server to test this. So far I've not seen issues regarding redis. There where some issues with Docker daemon itself failing, but this seem to have been resolved in a recent patch release. The server itself is on the low end of specs and running a batch every day so if there are any resource issues they are likely to surface here. When I add a exporter to collect node metrics (cpu, memory, processes, etc) we will probably also have a better insight in if leaks are still accumulating. But for now I don't see a reason to enable restart. I tried the suggested childs solution a while back, but it seemed to have no effect because the workers are eventlet based instead of thread.
Sounds this is fixed.