Internet.nl icon indicating copy to clipboard operation
Internet.nl copied to clipboard

Redis connection leak on eventlet workers

Open stitch opened this issue 2 years ago • 1 comments

This issue: https://github.com/celery/celery/issues/6819

Starting empty (hello world) tasks creates a connection which stays open indefinitely: would expect that after a task is done (an eventlet thread) that it would be cleaned up. That is not the case, over time the amount of open connections grows until file limits or memory limits are reached. The current workaround is to restart the workers every N hours.

The issue occurs without even hitting our code: simple tasks that do an IPv6 resolve already show leaking connections. It's possible to see what the workers do while being in this state, which is just the following:

strace -p 19321

clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=720383250}) = 0
epoll_wait(4, [], 1023, 0)              = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=720658417}) = 0
clock_gettime(CLOCK_REALTIME, {tv_sec=1647532476, tv_nsec=47295612}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=721018067}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=721183608}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=721325981}) = 0
epoll_wait(4, [], 1023, 0)              = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=721601382}) = 0
clock_gettime(CLOCK_REALTIME, {tv_sec=1647532476, tv_nsec=48239579}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=721955062}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=722120702}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=722264024}) = 0
epoll_wait(4, [], 1023, 0)              = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=722537283}) = 0
clock_gettime(CLOCK_REALTIME, {tv_sec=1647532476, tv_nsec=49175297}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=722893385}) = 0
clock_gettime(CLOCK_REALTIME, {tv_sec=1647532476, tv_nsec=49543284}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=723258824}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=723418006}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=723561222}) = 0
epoll_wait(4, [], 1023, 0)              = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=723833521}) = 0
clock_gettime(CLOCK_REALTIME, {tv_sec=1647532476, tv_nsec=50481187}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=724199070}) = 0
clock_gettime(CLOCK_REALTIME, {tv_sec=1647532476, tv_nsec=50850786}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=724566930}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=724726221}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=724867491}) = 0
epoll_wait(4, [], 1023, 0)              = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=725141201}) = 0
clock_gettime(CLOCK_REALTIME, {tv_sec=1647532476, tv_nsec=51777645}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=725493782}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=725652914}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563619, tv_nsec=725794919}) = 0
epoll_wait(4, [], 1023, 323)            = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563620, tv_nsec=50006472}) = 0
epoll_ctl(4, EPOLL_CTL_DEL, 13, 0x7ffe8b1905d4) = 0
recvfrom(13, 0x7fdfbca01b30, 7, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563620, tv_nsec=51768181}) = 0
epoll_ctl(4, EPOLL_CTL_ADD, 13, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=13, u64=140595754434573}}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563620, tv_nsec=52384110}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=563620, tv_nsec=52781898}) = 0

This is happening while redis has closed the connection. lsof -p 19321

python3.7 19321 internetnl  279u     IPv4           15982677      0t0      TCP localhost:40472->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl  280u     IPv4           15982681      0t0      TCP localhost:40476->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl  281u     IPv4           15982684      0t0      TCP localhost:40478->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl  282u     IPv4           15989159      0t0      TCP localhost:42238->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl  283u     IPv4           15983046      0t0      TCP localhost:41126->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl  284u     IPv4           15982999      0t0      TCP localhost:41060->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl  285u     IPv4           15985675      0t0      TCP localhost:41050->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl  286u     IPv4           15985676      0t0      TCP localhost:41052->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl  287u     IPv4           15983000      0t0      TCP localhost:41062->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl  288u     IPv4           15983047      0t0      TCP localhost:41132->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl  289u     IPv4           15983048      0t0      TCP localhost:41134->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl  290u     IPv4           15984932      0t0      TCP localhost:41280->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl  291u     IPv4           15984933      0t0      TCP localhost:41282->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl  292u     IPv4           15984934      0t0      TCP localhost:41284->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl  293u     IPv4           15984935      0t0      TCP localhost:41286->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl  294u     IPv4           15990893      0t0      TCP localhost:42264->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl  295u     IPv4           15985915      0t0      TCP localhost:41340->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl  296u     IPv4           15989160      0t0      TCP localhost:42240->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl  297u     IPv4           15989161      0t0      TCP localhost:42244->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl  298u     IPv4           15990894      0t0      TCP localhost:42266->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl  299u     IPv4           15988579      0t0      TCP localhost:42268->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl  300u     IPv4           15988580      0t0      TCP localhost:42270->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl  302u     IPv4           15985916      0t0      TCP localhost:41342->localhost:6379 (CLOSE_WAIT)
python3.7 19321 internetnl  307u     IPv4           15985917      0t0      TCP localhost:41350->localhost:6379 (CLOSE_WAIT)

We tried disabling hiredis, this has no effect. We tried removing any dependency on our code: no effect. A next option would be to check if we're able to achieve the same behavior on another machine, and further reduce the issue surface. Till will take some time.

stitch avatar Mar 17 '22 15:03 stitch

@stitch are there any new developments with this? Currently we work around this periodic restarts, pending an upstream fix?

mxsasha avatar Aug 29 '22 11:08 mxsasha

Status:

https://github.com/internetstandards/Internet.nl/tree/main/documentation/example_configuration/opt_internetnl_etc/cron

stitch avatar Jan 17 '23 09:01 stitch

@stitch: these Celery things can indeed be tricky to solve. The quick fix of doing a time based cron restart every 30 minutes: https://github.com/internetstandards/Internet.nl/blob/fe7c1fad6d67d61183fc5d700813008140646add/documentation/example_configuration/opt_internetnl_etc/cron/batch_crontab#L7 Is insufficient during high load, since time != tasks. Wouldn't it make more sense setting some max tasks a worker process can execute before respawning, to work around the connection leak? So in the config set some sensible N for the max tasks (also based on the amount of workers):

CELERYD_OPTS=--max-tasks-per-child N

bwbroersma avatar Jan 17 '23 21:01 bwbroersma

During debugging this issue we found that even with just regular 'resolve' tasks and nothing else, connections stay open after the task has finished.

The behavior is: task runs, finishes, results are returned, but the greenthread stays open with a file handle to redis. I could not reproduce this locally but also don't have the exact same setup. This is why redis is now configured to drop connections after a while.

I suspect that if this is a non-os/platform issue this will turn up in a containerized version of internet.nl.

stitch avatar Feb 07 '23 08:02 stitch

Perhaps #890 will magically cause this to disappear, or allow us to make a smaller test case.

mxsasha avatar Apr 11 '23 15:04 mxsasha

This issue seems resolved by switching to Celery 5.3.1 and enabling CELERY_RESULT_BACKEND_THREAD_SAFE in settings.py. Changes are in the #890 PR.

aequitas avatar Jun 29 '23 11:06 aequitas

@aequitas: to go from short cron restarts to zero restarts, might be a bit tricky, would it be a wise fail safe to still implement https://github.com/internetstandards/Internet.nl/issues/676#issuecomment-1386070619? Or are the childs in our case still running in one process / isn't there a benefit to do this?

bwbroersma avatar Sep 05 '23 08:09 bwbroersma

@bwbroersma I'm using the ipv6kaartmetingen server to test this. So far I've not seen issues regarding redis. There where some issues with Docker daemon itself failing, but this seem to have been resolved in a recent patch release. The server itself is on the low end of specs and running a batch every day so if there are any resource issues they are likely to surface here. When I add a exporter to collect node metrics (cpu, memory, processes, etc) we will probably also have a better insight in if leaks are still accumulating. But for now I don't see a reason to enable restart. I tried the suggested childs solution a while back, but it seemed to have no effect because the workers are eventlet based instead of thread.

aequitas avatar Sep 05 '23 10:09 aequitas

Sounds this is fixed.

mxsasha avatar Jan 09 '24 09:01 mxsasha