kombu
kombu copied to clipboard
Orphaned reply pidboxes in Redis
Hello all We use Celery 3.0.21 with Redis backend (AWS ElastiCache) and recently I discovered huge memory usage by Redis node. After dumping Redis database and inspecting it I found that there are many reply pidboxes (~40) with 1 big element in array (about 16MB) in which there are many duplicated scheduled tasks (more than thousand).
I think duplicated tasks is a different issue. So could anybody clarifing approach to delete unused reply pidboxes? I've checked number of reply pidboxes with members of _kombu.binding.reply.celery.pidbox hash and found inconsistency (41 in bindings versus 77 at all).
And I would much appreciate it if anybody explaine case when these orphaned pidboxes could appear.
These replies are sent for remote control commands. I know flower is constantly sending out control commands, but I'm not sure there is an easy fix.
The best option would be if the replies were using PUBSUB instead of being stored in a list.
cc @mher
Are these replies safe to delete @ask?
I had the same problem. I disabled gossip and mingle to prevent these from getting created. I found that these were growing larger and larger due to an ever-growing list of revoked tasks in the mingle replies.
For my future self and other googlers. This can be disabled in django with:
CELERY_WORKER_ENABLE_REMOTE_CONTROL = False # celery 4
CELERY_ENABLE_REMOTE_CONTROL = False # celery 3
see: https://docs.celeryproject.org/en/stable/userguide/configuration.html#worker-enable-remote-control
And yes, had this problem twice because of the name change from v3 to v4 :(
Was this fixed?
I also find the same problem. The way how I handle is that delete all the .reply.celery.pidbox (except for _kombu.binding.reply.celery.pidbox) ,then set CELERY_SEND_EVENTS = False But I don't know what the ".reply.celery.pidbox" keys used for.
In my project's case, we occasionally get a lot of revoked tasks (due to task expiration). All workers get the IDs locally. When new workers come up, they receive the revoked list from all the others. When there is a large amount of revoked tasks, and a large number of workers (dozens at least. We hang around ~100 workers currently), all of that info is being sent by all of these currently running workers, to each new worker..and if you have many new workers...at least for redis, it will topple the redis RAM (if it's not large enough); the new workers are popping these revoked responses off their queues, but it can't go fast enough, so there is a potentially huge backlog of data sitting in redis. At this point, any/most/all new workers (and some old surely) will get a redis OOM error connection, and they'll crash/not connect to redis (at least via celery/kombu semantics). The reply queue has a UUID in it, which is only valid during the runtime of the worker. So the workers die. Restart. Orphaned queues.
I have opened an issue related to the LimitedSet implementation behind revoked tasks, since I think that is a big part of the problem (in certain cases).
I cannot speak to other cases (non-revoke, which yes, is part of mingle).
We have been deleting pidbox keys.
I've recently implemented custom revoke clear & purge (and expire and max) control commands so that we can manage this more officially (don't delete via redis. Just squash the data on the workers appropriately). I made a gist on my github with a version of that implementation.
Is there any update on this issue? This is causing an issue for my company's Celery cluster as well.
how to fix this issue?thanks for any help!
how to fix this issue?thanks for any help!
https://github.com/celery/kombu/issues/294#issuecomment-625558433 open to trying?
Our solution to this issue is to periodically remove keys that have been idle for more than X days
red = app.backend.client
for key in red.scan_iter("*", 100000):
if key.decode().startswith('_kombu'):
continue
if red.object('idletime', key) >= (24 * 7 * 3600):
red.delete(key)
Our solution to this issue is to periodically remove keys that have been idle for more than X days
red = app.backend.client for key in red.scan_iter("*", 100000): if key.decode().startswith('_kombu'): continue if red.object('idletime', key) >= (24 * 7 * 3600): red.delete(key)
Actually, these keys are generated by Celery broker, not result backend. So if you are using different Redis DBs for broker and result backend, it not works.
a little update for getting Redis client from Celery app instance.
import redis
redis_instance = redis.from_url(celery_app.conf.broker_url)
for key in redis_instance.scan_iter("*", 100000):
if key.decode().endswith("reply.celery.pidbox") and redis_instance.object(
"idletime", key
) >= (24 * 7 * 3600):
redis_instance.delete(key)