kombu Orphaned reply pidboxes in Redis

trafficstars

Hello all We use Celery 3.0.21 with Redis backend (AWS ElastiCache) and recently I discovered huge memory usage by Redis node. After dumping Redis database and inspecting it I found that there are many reply pidboxes (~40) with 1 big element in array (about 16MB) in which there are many duplicated scheduled tasks (more than thousand).

I think duplicated tasks is a different issue. So could anybody clarifing approach to delete unused reply pidboxes? I've checked number of reply pidboxes with members of _kombu.binding.reply.celery.pidbox hash and found inconsistency (41 in bindings versus 77 at all).

And I would much appreciate it if anybody explaine case when these orphaned pidboxes could appear.

Dec 23 '13 16:12 rkul

These replies are sent for remote control commands. I know flower is constantly sending out control commands, but I'm not sure there is an easy fix.

The best option would be if the replies were using PUBSUB instead of being stored in a list.

cc @mher

Jan 13 '14 16:01 ask

Are these replies safe to delete @ask?

Jul 10 '17 11:07 31z4

I had the same problem. I disabled gossip and mingle to prevent these from getting created. I found that these were growing larger and larger due to an ever-growing list of revoked tasks in the mingle replies.

Jun 27 '18 18:06 tilgovi

For my future self and other googlers. This can be disabled in django with:

CELERY_WORKER_ENABLE_REMOTE_CONTROL = False # celery 4 CELERY_ENABLE_REMOTE_CONTROL = False # celery 3

see: https://docs.celeryproject.org/en/stable/userguide/configuration.html#worker-enable-remote-control

And yes, had this problem twice because of the name change from v3 to v4 :(

Feb 12 '20 16:02 michi88

Was this fixed?

Feb 28 '20 20:02 tilgovi

I also find the same problem. The way how I handle is that delete all the .reply.celery.pidbox (except for _kombu.binding.reply.celery.pidbox) ,then set CELERY_SEND_EVENTS = False But I don't know what the ".reply.celery.pidbox" keys used for.

Apr 26 '20 02:04 xiaobaidemu

In my project's case, we occasionally get a lot of revoked tasks (due to task expiration). All workers get the IDs locally. When new workers come up, they receive the revoked list from all the others. When there is a large amount of revoked tasks, and a large number of workers (dozens at least. We hang around ~100 workers currently), all of that info is being sent by all of these currently running workers, to each new worker..and if you have many new workers...at least for redis, it will topple the redis RAM (if it's not large enough); the new workers are popping these revoked responses off their queues, but it can't go fast enough, so there is a potentially huge backlog of data sitting in redis. At this point, any/most/all new workers (and some old surely) will get a redis OOM error connection, and they'll crash/not connect to redis (at least via celery/kombu semantics). The reply queue has a UUID in it, which is only valid during the runtime of the worker. So the workers die. Restart. Orphaned queues.

I have opened an issue related to the LimitedSet implementation behind revoked tasks, since I think that is a big part of the problem (in certain cases).

I cannot speak to other cases (non-revoke, which yes, is part of mingle).

We have been deleting pidbox keys.

I've recently implemented custom revoke clear & purge (and expire and max) control commands so that we can manage this more officially (don't delete via redis. Just squash the data on the workers appropriately). I made a gist on my github with a version of that implementation.

May 08 '20 00:05 jheld

Is there any update on this issue? This is causing an issue for my company's Celery cluster as well.

Mar 31 '21 01:03 embooglement

how to fix this issue？thanks for any help！

May 05 '22 12:05 colinsarah

how to fix this issue？thanks for any help！

https://github.com/celery/kombu/issues/294#issuecomment-625558433 open to trying?

May 05 '22 16:05 jheld

Our solution to this issue is to periodically remove keys that have been idle for more than X days

red = app.backend.client
for key in red.scan_iter("*", 100000):
    if key.decode().startswith('_kombu'):
        continue

    if red.object('idletime', key) >= (24 * 7 * 3600):
        red.delete(key)

Oct 25 '22 11:10 linar-jether

Our solution to this issue is to periodically remove keys that have been idle for more than X days
red = app.backend.client
for key in red.scan_iter("*", 100000):
    if key.decode().startswith('_kombu'):
        continue

    if red.object('idletime', key) >= (24 * 7 * 3600):
        red.delete(key)

Actually, these keys are generated by Celery broker, not result backend. So if you are using different Redis DBs for broker and result backend, it not works.

a little update for getting Redis client from Celery app instance.

import redis

redis_instance = redis.from_url(celery_app.conf.broker_url)

for key in redis_instance.scan_iter("*", 100000):
    if key.decode().endswith("reply.celery.pidbox") and redis_instance.object(
        "idletime", key
    ) >= (24 * 7 * 3600):
        redis_instance.delete(key)

Aug 02 '23 09:08 mrchi

kombu kombu copied to clipboard

Orphaned reply pidboxes in Redis

kombu
kombu copied to clipboard