kombu icon indicating copy to clipboard operation
kombu copied to clipboard

Orphaned reply pidboxes in Redis

Open rkul opened this issue 11 years ago • 12 comments

Hello all We use Celery 3.0.21 with Redis backend (AWS ElastiCache) and recently I discovered huge memory usage by Redis node. After dumping Redis database and inspecting it I found that there are many reply pidboxes (~40) with 1 big element in array (about 16MB) in which there are many duplicated scheduled tasks (more than thousand).

I think duplicated tasks is a different issue. So could anybody clarifing approach to delete unused reply pidboxes? I've checked number of reply pidboxes with members of _kombu.binding.reply.celery.pidbox hash and found inconsistency (41 in bindings versus 77 at all).

And I would much appreciate it if anybody explaine case when these orphaned pidboxes could appear.

rkul avatar Dec 23 '13 16:12 rkul

These replies are sent for remote control commands. I know flower is constantly sending out control commands, but I'm not sure there is an easy fix.

The best option would be if the replies were using PUBSUB instead of being stored in a list.

cc @mher

ask avatar Jan 13 '14 16:01 ask

Are these replies safe to delete @ask?

31z4 avatar Jul 10 '17 11:07 31z4

I had the same problem. I disabled gossip and mingle to prevent these from getting created. I found that these were growing larger and larger due to an ever-growing list of revoked tasks in the mingle replies.

tilgovi avatar Jun 27 '18 18:06 tilgovi

For my future self and other googlers. This can be disabled in django with:

CELERY_WORKER_ENABLE_REMOTE_CONTROL = False # celery 4 CELERY_ENABLE_REMOTE_CONTROL = False # celery 3

see: https://docs.celeryproject.org/en/stable/userguide/configuration.html#worker-enable-remote-control

And yes, had this problem twice because of the name change from v3 to v4 :(

michi88 avatar Feb 12 '20 16:02 michi88

Was this fixed?

tilgovi avatar Feb 28 '20 20:02 tilgovi

I also find the same problem. The way how I handle is that delete all the .reply.celery.pidbox (except for _kombu.binding.reply.celery.pidbox) ,then set CELERY_SEND_EVENTS = False But I don't know what the ".reply.celery.pidbox" keys used for.

xiaobaidemu avatar Apr 26 '20 02:04 xiaobaidemu

In my project's case, we occasionally get a lot of revoked tasks (due to task expiration). All workers get the IDs locally. When new workers come up, they receive the revoked list from all the others. When there is a large amount of revoked tasks, and a large number of workers (dozens at least. We hang around ~100 workers currently), all of that info is being sent by all of these currently running workers, to each new worker..and if you have many new workers...at least for redis, it will topple the redis RAM (if it's not large enough); the new workers are popping these revoked responses off their queues, but it can't go fast enough, so there is a potentially huge backlog of data sitting in redis. At this point, any/most/all new workers (and some old surely) will get a redis OOM error connection, and they'll crash/not connect to redis (at least via celery/kombu semantics). The reply queue has a UUID in it, which is only valid during the runtime of the worker. So the workers die. Restart. Orphaned queues.

I have opened an issue related to the LimitedSet implementation behind revoked tasks, since I think that is a big part of the problem (in certain cases).

I cannot speak to other cases (non-revoke, which yes, is part of mingle).

We have been deleting pidbox keys.

I've recently implemented custom revoke clear & purge (and expire and max) control commands so that we can manage this more officially (don't delete via redis. Just squash the data on the workers appropriately). I made a gist on my github with a version of that implementation.

jheld avatar May 08 '20 00:05 jheld

Is there any update on this issue? This is causing an issue for my company's Celery cluster as well.

embooglement avatar Mar 31 '21 01:03 embooglement

how to fix this issue?thanks for any help!

colinsarah avatar May 05 '22 12:05 colinsarah

how to fix this issue?thanks for any help!

https://github.com/celery/kombu/issues/294#issuecomment-625558433 open to trying?

jheld avatar May 05 '22 16:05 jheld

Our solution to this issue is to periodically remove keys that have been idle for more than X days

red = app.backend.client
for key in red.scan_iter("*", 100000):
    if key.decode().startswith('_kombu'):
        continue
​
    if red.object('idletime', key) >= (24 * 7 * 3600):
        red.delete(key)

linar-jether avatar Oct 25 '22 11:10 linar-jether

Our solution to this issue is to periodically remove keys that have been idle for more than X days

red = app.backend.client
for key in red.scan_iter("*", 100000):
    if key.decode().startswith('_kombu'):
        continue
​
    if red.object('idletime', key) >= (24 * 7 * 3600):
        red.delete(key)

Actually, these keys are generated by Celery broker, not result backend. So if you are using different Redis DBs for broker and result backend, it not works.

a little update for getting Redis client from Celery app instance.

import redis

redis_instance = redis.from_url(celery_app.conf.broker_url)

for key in redis_instance.scan_iter("*", 100000):
    if key.decode().endswith("reply.celery.pidbox") and redis_instance.object(
        "idletime", key
    ) >= (24 * 7 * 3600):
        redis_instance.delete(key)

mrchi avatar Aug 02 '23 09:08 mrchi