django-celery-beat icon indicating copy to clipboard operation
django-celery-beat copied to clipboard

Celery Beat silently stops working after period of time, without error

Open timmyomahony opened this issue 1 year ago • 30 comments

Summary:

Celery Beat silently fails after an unpredictable amount of time running on Digital Ocean App Platform, meaning tasks are no longer executed. There are no obvious indications in the logs.

My setup:

  • Django 5.1.2
  • Celery-Beat Version: 2.70
  • Celery Version 5.4.0
  • Redis 7
  • Postgres 16

Exact steps to reproduce the issue:

  1. Deploy Celery Beat to Digital Ocean App Platform as part of a Django app
  2. Configure scheduled task via database scheduler
  3. Leave Celery Beat running
  4. After hours/days, tasks stop being run

Detailed information

I'm running Celery Beat on Digital Ocean App Platform (Docker based deployments via build packs) via the command:

celery -A config.celery_app beat -l debug --scheduler django_celery_beat.schedulers:DatabaseScheduler

After an unpredictable amount of time (usually days) Celery Beat will stop running tasks. There are no console errors and the process doesn't crash, Celery simply stops running scheduled tasks silently. If I redeploy the app, tasks will resume.

Having enabled debug logging, the final lines in the log are:

[celery-beat] [2024-11-13 00:49:27] [2024-11-13 00:49:27,236: DEBUG/MainProcess] beat: Waking up in 5.00 seconds.
[celery-beat] [2024-11-13 00:49:32] [2024-11-13 00:49:32,237: DEBUG/MainProcess] beat: Synchronizing schedule...
[celery-beat] [2024-11-13 00:49:32] [2024-11-13 00:49:32,238: DEBUG/MainProcess] Writing entries...
[celery-beat] [2024-11-13 00:49:32] [2024-11-13 00:49:32,278: DEBUG/MainProcess] beat: Waking up in 5.00 seconds.
[celery-beat] [2024-11-13 00:49:37] [2024-11-13 00:49:37,311: DEBUG/MainProcess] beat: Waking up in 5.00 seconds.
[celery-beat] [2024-11-13 00:49:42] [2024-11-13 00:49:42,348: DEBUG/MainProcess] beat: Waking up in 5.00 seconds.
[celery-beat] [2024-11-13 00:49:47] [2024-11-13 00:49:47,381: DEBUG/MainProcess] beat: Waking up in 5.00 seconds.
[celery-beat] [2024-11-13 00:49:52] [2024-11-13 00:49:52,415: DEBUG/MainProcess] beat: Waking up in 5.00 seconds.
[celery-beat] [2024-11-13 00:49:57] [2024-11-13 00:49:57,447: DEBUG/MainProcess] beat: Waking up in 2.54 seconds.
[celery-beat] [2024-11-13 00:50:00] [2024-11-13 00:50:00,068: INFO/MainProcess] Scheduler: Sending due task Celery uptime heartbeat (fetch.utils.tasks.celery_uptime_heartbeat)
[celery-beat] [2024-11-13 00:50:00] [2024-11-13 00:50:00,091: DEBUG/MainProcess] fetch.utils.tasks.celery_uptime_heartbeat sent. id->657d0fd4-222a-474f-9c8f-13142909c69b

The final line here is the running of a scheduled task that I'm using to send heartbeats to an uptime monitor. I set up this task to help diagnose the issue and track when it occurs - there's nothing wrong with this task specifically.

  • I've looked at the metrics and there is no issue with resources (there is enough RAM and CPU).
  • I'm running a similar setup in a separate project (Digital Ocean App Platform) which doesn't have this issue (same Celery Beat versions)

I'm unsure of how I can further investigate the issue

timmyomahony avatar Nov 13 '24 05:11 timmyomahony

Hi, I'm encountering the same issue while running the following setup on AWS:

  • django 5.1.4
  • django-celery-beat 2.7.0
  • redis 7.0.5
  • postgres 14.3

Has anyone experienced this before or have any recommendations for troubleshooting?

Thanks!

rapsealk avatar Jan 22 '25 07:01 rapsealk

Do you happen to be using connection pooling? I was using psycopg[pool]. By removing [pool] I was able to stop the issue. Not sure exactly why it was happening.

timmyomahony avatar Jan 22 '25 08:01 timmyomahony

Thank you! I’m currently using psycopg2==2.9.10 without the [pool] option. I’ll try testing with some different database settings to see if that helps.

rapsealk avatar Jan 22 '25 09:01 rapsealk

any update on the root of this issue?

stephappiah avatar Jun 09 '25 10:06 stephappiah

psycopg2

how did you resolve this?

stephappiah avatar Jun 09 '25 10:06 stephappiah

I have the same problem. I use

celery==5.5.3
hiredis==3.2.1
redis==6.2.0
Django==5.2.3

As redis server I use Valkey 8.0. I start Celery with the command celery -A myapp beat -l INFO. Today, after a little more than 12 days, Celery Beat is also no longer sending tasks. I am running Celery Beat in a Kubernetes deployment. When I restart the pod, Celery Beat sends the tasks again.

bast-ii avatar Jun 24 '25 06:06 bast-ii

The same issue with:

  • celery==5.5.3
  • django-celery-beat==2.5.0

It just stuck without any errors:

...
[2025-06-25 02:00:00,137: INFO/MainProcess] Scheduler: Sending due task TASK_NAME

igorMIA avatar Jun 25 '25 11:06 igorMIA

I have the exact same issue, felt like I could have written this post!!! Has anyone figured this out yet? I've been trying to fix this for months with no luck :(

jeparalta avatar Jul 03 '25 20:07 jeparalta

I encountered the same issue using DoctorDroid with:

  • celery==5.5.3
  • django-celery-beat==2.4.0

The problem is that celery-beat silently stops working without raising any exceptions or emitting logs. I suspect two potential root causes:

  1. Network-related interruptions.
  2. Timezone or clock synchronization issues between nodes (my deployment is on a self-managed Kubernetes cluster in a private cloud).

To isolate the issue, I tested scenarios when celery-beat lost connection to database. I manually disconnecting celery-beat from Redis and PostgreSQL. In those cases, exceptions were logged to celery-beat log file, and celery-beat attempted to reconnect once the database became alive again.

However in this case, celery-beat simply stops scheduling without any errors or logs

Our temporary solution is implementing a sidecar container that continuously monitors the celery-beat log file. If no new logs are detected over an interval (10s, 20s, ...), it restarts the celery-beat process.

congnghiahieu avatar Jul 04 '25 02:07 congnghiahieu

I don't know why it worked, but it did when I rolled back to the previous service build. I think it's related to library versions. These versions work for me:

  • celery==5.4.0
  • django-celery-beat==2.5.0
  • redis==5.2.1
  • Django==4.2.19

igorMIA avatar Jul 04 '25 08:07 igorMIA

I encountered the same issue using DoctorDroid with:

  • celery==5.5.3
  • django-celery-beat==2.4.0

The problem is that celery-beat silently stops working without raising any exceptions or emitting logs. I suspect two potential root causes:

  1. Network-related interruptions.
  2. Timezone or clock synchronization issues between nodes (my deployment is on a self-managed Kubernetes cluster in a private cloud).

To isolate the issue, I tested scenarios when celery-beat lost connection to database. I manually disconnecting celery-beat from Redis and PostgreSQL. In those cases, exceptions were logged to celery-beat log file, and celery-beat attempted to reconnect once the database became alive again.

However in this case, celery-beat simply stops scheduling without any errors or logs

Our temporary solution is implementing a sidecar container that continuously monitors the celery-beat log file. If no new logs are detected over an interval (10s, 20s, ...), it restarts the celery-beat process.

Good idea! Could you share the sidecar implementation?

bast-ii avatar Jul 04 '25 09:07 bast-ii

I have the same problem:

celery=5.5.2
redis=6.1.0
django=5.1.5

I start Celery with the command celery -A myapp beat -l INFO.

kamalfarahani avatar Jul 11 '25 11:07 kamalfarahani

I'm having the same problem with library versions already reported in this thread. I'm using Python 3.9, could this issue be related to an interaction between the libraries and a specific Python version?

ivanmviveros avatar Jul 18 '25 18:07 ivanmviveros

Just to note that although removing the pool option seemed to help, I'm still occasionally having the same issue on only one of my environments. I have multiple environments deployed (prod, staging,...) and only one is affected leading me to believe this is some sort of issue with the underlying Docker VM and versions.

timmyomahony avatar Jul 18 '25 18:07 timmyomahony

Also, to potentially help others, I created a Celery task on the host itself to send a heartbeat to UptimeRobot every 5 minutes. This helps me see when celery-beat has stopped, after which I manually restart. Not ideal but helps with debugging

timmyomahony avatar Jul 18 '25 18:07 timmyomahony

I posted here recently with the same issue and since then I found a solution to my problem. In my case (also using Digital Ocean app platform) the issue was related to how the Redis managed database would handle idle connections. When the idle connections would be removed but celery would try and store a backend result using this connection, celery beat would silently fail and become corrupted.

The solution was to disable backend results on celery settings (make sure you also remove any environment variables for celery backend results). This immediately fixed my issue.

I can add the specific settings later tonight when I’m back to my computer. Hope this helps!

jeparalta avatar Jul 18 '25 18:07 jeparalta

I posted here recently with the same issue and since then I found a solution to my problem. In my case (also using Digital Ocean app platform) the issue was related to how the Redis managed database would handle idle connections. When the idle connections would be removed but celery would try and store a backend result using this connection, celery beat would silently fail and become corrupted.

The solution was to disable backend results on celery settings (make sure you also remove any environment variables for celery backend results). This immediately fixed my issue.

I can add the specific settings later tonight when I’m back to my computer. Hope this helps!

Interesting, I hadn't considered Redis being the issue (I'm also using DO hosted Redis). If you could share the config later that would be great. Not sure why I'm only experiencing it on one environment. I'll have to check if I have separate Redis configs for prod vs staging. Thanks for the update.

timmyomahony avatar Jul 18 '25 18:07 timmyomahony

HEre are the specifics on my fix

The issue causing the tasks to stop running seems to have been related to how Digital Ocean managed databases deal with idle connections. So since I was using Redis for Cache (database 0); for my Celery broker (database 1) and my Celery backend results (database 2). This all worked fine until some idle connections where closed and then Celery would try and access them again to write the backend result. This would somehow put the Celery Beat sscheduler into a corrupted state that would make it stop sending new tasks to Celery.

Solution:

Since I'm not using tasks in a way that I actually need the results kept, I completely disabled Results on Celery settings. This involved updating the Django Settigs to

CELERY_RESULT_BACKEND = None
CELERY_TASK_IGNORE_RESULT = True

Also I removed the Enviroment variable from Digital Ocean to make sure that backend was disabled. When starting up Celery it should look something like this:

transport: redis://redis:6379/0 results: disabled://

jeparalta avatar Jul 21 '25 21:07 jeparalta

I have a similar issue and can't find anything about it. I'm running celery + celery beat and a redis on a VPS. Beat is set to 10 seconds and stops working after roughly 60 hours.

Fritskee avatar Jul 24 '25 14:07 Fritskee

Having a similar issue. I'm running celery tasks on AWS and sometimes the tasks just get missed without errors. This is because they are not getting scheduled by celery-beat at all

rishijatia avatar Jul 27 '25 14:07 rishijatia

Same here. Hangs randomly, sometimes after 1 hour, sometimes after 2 days. Remedied by restarting container every 30 mins.

python = "^3.13"
Django = "^5.2.5"
celery = "^5.5.3"
django-celery-beat = "^2.8.1"
CELERY_BEAT_SCHEDULER = "django_celery_beat.schedulers:DatabaseScheduler"
CELERY_RESULT_EXTENDED = True

DbScheduler, redis result backend. DB with pool and psycopg3.

Tried this, without success:

app.conf.broker_pool_limit = 0
app.conf.broker_channel_error_retry = True

Object905 avatar Aug 14 '25 07:08 Object905

Same here. Hangs randomly, sometimes after 1 hour, sometimes after 2 days. Remedied by restarting container every 30 mins.

python = "^3.13"
Django = "^5.2.5"
celery = "^5.5.3"
django-celery-beat = "^2.8.1"
CELERY_BEAT_SCHEDULER = "django_celery_beat.schedulers:DatabaseScheduler"
CELERY_RESULT_EXTENDED = True

DbScheduler, redis result backend. DB with pool and psycopg3.

Tried this, without success:

app.conf.broker_pool_limit = 0
app.conf.broker_channel_error_retry = True

Did you try disabling the Results Backend as mentioned in message above?

jeparalta avatar Aug 14 '25 09:08 jeparalta

I need results backend. Will try to change it to DB. But redis is most optimal for my use case because of lots of small tasks.

Object905 avatar Aug 14 '25 11:08 Object905

Tried using DB result backend and redis broker instead of rabbitmq - problem remains.

Object905 avatar Aug 21 '25 06:08 Object905

HEre are the specifics on my fix

The issue causing the tasks to stop running seems to have been related to how Digital Ocean managed databases deal with idle connections. So since I was using Redis for Cache (database 0); for my Celery broker (database 1) and my Celery backend results (database 2). This all worked fine until some idle connections where closed and then Celery would try and access them again to write the backend result. This would somehow put the Celery Beat sscheduler into a corrupted state that would make it stop sending new tasks to Celery.

Solution:

Since I'm not using tasks in a way that I actually need the results kept, I completely disabled Results on Celery settings. This involved updating the Django Settigs to

CELERY_RESULT_BACKEND = None CELERY_TASK_IGNORE_RESULT = True

Also I removed the Enviroment variable from Digital Ocean to make sure that backend was disabled. When starting up Celery it should look something like this:

transport: redis://redis:6379/0 results: disabled://

Thanks @jeparalta I changed the configuration as you suggested six days ago and haven't had any problems since. So that seems to fix the problem for now. However, if the result backend is required, the solution doesn't work.

I'm surprised that celery beat doesn't throw an error. That would be best, because then you could automatically restart the process/pod.

bast-ii avatar Aug 25 '25 05:08 bast-ii

HEre are the specifics on my fix The issue causing the tasks to stop running seems to have been related to how Digital Ocean managed databases deal with idle connections. So since I was using Redis for Cache (database 0); for my Celery broker (database 1) and my Celery backend results (database 2). This all worked fine until some idle connections where closed and then Celery would try and access them again to write the backend result. This would somehow put the Celery Beat sscheduler into a corrupted state that would make it stop sending new tasks to Celery. Solution: Since I'm not using tasks in a way that I actually need the results kept, I completely disabled Results on Celery settings. This involved updating the Django Settigs to CELERY_RESULT_BACKEND = None CELERY_TASK_IGNORE_RESULT = True Also I removed the Enviroment variable from Digital Ocean to make sure that backend was disabled. When starting up Celery it should look something like this: transport: redis://redis:6379/0 results: disabled://

Thanks @jeparalta I changed the configuration as you suggested six days ago and haven't had any problems since. So that seems to fix the problem for now. However, if the result backend is required, the solution doesn't work.

I'm surprised that celery beat doesn't throw an error. That would be best, because then you could automatically restart the process/pod.

@bast-ii Iguess if you need backend results you could still have them on another service? maybe outside of DO app platform on a droplet or something.. Im not really sure, as I havent looked into this but could be a possibility?

jeparalta avatar Aug 25 '25 12:08 jeparalta

I need results backend. Will try to change it to DB. But redis is most optimal for my use case because of lots of small tasks.

@Object905 could you not use a separate Redis instance for the results?

jeparalta avatar Aug 25 '25 12:08 jeparalta

HEre are the specifics on my fix The issue causing the tasks to stop running seems to have been related to how Digital Ocean managed databases deal with idle connections. So since I was using Redis for Cache (database 0); for my Celery broker (database 1) and my Celery backend results (database 2). This all worked fine until some idle connections where closed and then Celery would try and access them again to write the backend result. This would somehow put the Celery Beat sscheduler into a corrupted state that would make it stop sending new tasks to Celery. Solution: Since I'm not using tasks in a way that I actually need the results kept, I completely disabled Results on Celery settings. This involved updating the Django Settigs to CELERY_RESULT_BACKEND = None CELERY_TASK_IGNORE_RESULT = True Also I removed the Enviroment variable from Digital Ocean to make sure that backend was disabled. When starting up Celery it should look something like this: transport: redis://redis:6379/0 results: disabled://

Thanks @jeparalta I changed the configuration as you suggested six days ago and haven't had any problems since. So that seems to fix the problem for now. However, if the result backend is required, the solution doesn't work. I'm surprised that celery beat doesn't throw an error. That would be best, because then you could automatically restart the process/pod.

@bast-ii Iguess if you need backend results you could still have them on another service? maybe outside of DO app platform on a droplet or something.. Im not really sure, as I havent looked into this but could be a possibility?

I don't need the backend results, so I don't know if that would work. Have you tried to find out what the problem is? Why does it work without result backend?

bast-ii avatar Aug 25 '25 13:08 bast-ii

I need results backend. Will try to change it to DB. But redis is most optimal for my use case because of lots of small tasks.

@Object905 could you not use a separate Redis instance for the results?

Right now I'm using django-db results backend and the problem remains, so I don't think that separate instance will solve this.

Object905 avatar Aug 25 '25 14:08 Object905

Maybe this issue exists within the wrong repository, because we also run into that situation and not using django-celery-beat package for legacy reason. We only use multiple celery workers with a beat instance and shelve storage in django 4.2

And redis 8 as broker and result backend.

One information: Before we update, we run django==3.2 and celery==5.4 with redis broker in version 7 and do not have any problems.

Our versions:

django==4.2.24 django-redis==6.0.0 celery==5.5.3 redis==6.2.0 psycopg2-binary==2.9.10

Broker is a self hosted redis 8.

To find some commonalities: Everything runs within docker 24.0.2 containers withou a VMWare ESX VM with CentOS .

But we found one way to reproduce this situation (in our situation): When we simple hard restart redis, then beat still execute heartbeats, but no more scheduled tasks (logs), without any errors on "debug" log level.

[2025-09-30 13:19:24,650: DEBUG/MainProcess] {"message": "Server heartbeat succeeded", "topologyId": {"$oid": "68dbbc6c77a29039536a25ae"}, "driverConnectionId": 1, "serverConnectionId": 1661240, "serverHost": "db", "reply": "{\"isWritablePrimary\": true, \"topologyVersion\": {\"processId\": {\"$oid\": \"68b04764690239bf27525385\"}}, \"maxBsonObjectSize\": 16777216, \"maxMessageSizeBytes\": 48000000, \"maxWriteBatchSize\": 100000, \"localTime\": {\"$date\": \"2025-09-30T11:19:24.650Z\"}, \"logicalSessionTimeoutMinutes\": 30, \"connectionId\": 1661240, \"maxWireVersion\": 21, \"ok\": 1.0}"}
[2025-09-30 13:19:24,651: DEBUG/MainProcess] {"message": "Server heartbeat started", "topologyId": {"$oid": "68dbbc6c77a29039536a25ae"}, "driverConnectionId": 1, "serverConnectionId": 1661240, "serverHost": "db", "awaited": true}
[2025-09-30 13:19:34,661: DEBUG/MainProcess] {"message": "Server heartbeat succeeded", "topologyId": {"$oid": "68dbbc6c77a29039536a25ae"}, "driverConnectionId": 1, "serverConnectionId": 1661240, "serverHost": "db", "reply": "{\"isWritablePrimary\": true, \"topologyVersion\": {\"processId\": {\"$oid\": \"68b04764690239bf27525385\"}}, \"maxBsonObjectSize\": 16777216, \"maxMessageSizeBytes\": 48000000, \"maxWriteBatchSize\": 100000, \"localTime\": {\"$date\": \"2025-09-30T11:19:34.660Z\"}, \"logicalSessionTimeoutMinutes\": 30, \"connectionId\": 1661240, \"maxWireVersion\": 21, \"ok\": 1.0}"}
[2025-09-30 13:19:34,661: DEBUG/MainProcess] {"message": "Server heartbeat started", "topologyId": {"$oid": "68dbbc6c77a29039536a25ae"}, "driverConnectionId": 1, "serverConnectionId": 1661240, "serverHost": "db", "awaited": true}

When log level is not "debug", then this log will stay on last scheduled task, before we restarted redis. I would expect some type of "connection error" or so on.

swarnat avatar Sep 30 '25 11:09 swarnat