authentik icon indicating copy to clipboard operation
authentik copied to clipboard

Worker won't reconnect to Redis after a connection drop

Open BallistiX09 opened this issue 2 years ago • 7 comments

Describe the bug When the worker disconnects from the Redis container for any reason (in my case, updating the Redis container), the worker fails to reconnect and ends up stuck in an unhealthy state until manually restarted. This also causes it to break its connection with Authentik.

To Reproduce

  1. Set up both the worker and Redis in a running, healthy state.
  2. Stop the Redis container until the worker starts logging out connection issues, attempting to reconnect.
  3. Start the Redis container.
  4. The worker will enter an unhealthy state after the Redis container starts back up, and will no longer connect back to Authentik.

Expected behavior The worker should be able to recover from a dropped connection automatically without needing a manual restart.

Logs

{"event": "/ak-root/venv/lib/python3.12/site-packages/celery/worker/consumer/consumer.py:391: CPendingDeprecationWarning: \nIn Celery 5.1 we introduced an optional breaking change which\non connection loss cancels all currently executed tasks with late acknowledgement enabled.\nThese tasks cannot be acknowledged as the connection is gone, and the tasks are automatically redelivered\nback to the queue. You can enable this behavior using the worker_cancel_long_running_tasks_on_connection_loss\nsetting. In Celery 5.1 it is set to False by default. The setting will be set to True by default in Celery 6.0.\n\n  warnings.warn(CANCEL_TASKS_BY_DEFAULT, CPendingDeprecationWarning)\n", "level": "warning", "logger": "py.warnings", "timestamp": 1713077143.2660928}
{"event": "/ak-root/venv/lib/python3.12/site-packages/celery/worker/consumer/consumer.py:507: CPendingDeprecationWarning: The broker_connection_retry configuration setting will no longer determine\nwhether broker connection retries are made during startup in Celery 6.0 and above.\nIf you wish to retain the existing behavior for retrying connections on startup,\nyou should set broker_connection_retry_on_startup to True.\n  warnings.warn(\n", "level": "warning", "logger": "py.warnings", "timestamp": 1713077143.2694874}
{"event": "consumer: Cannot connect to redis://redis-authentik:6379/0: Error 111 connecting to redis-authentik:6379. Connection refused..\nTrying again in 2.00 seconds... (1/100)\n", "level": "error", "logger": "celery.worker.consumer.consumer", "timestamp": 1713077143.27144}
{"event": "consumer: Cannot connect to redis://redis-authentik:6379/0: Error -2 connecting to redis-authentik:6379. Name or service not known..\nTrying again in 4.00 seconds... (2/100)\n", "level": "error", "logger": "celery.worker.consumer.consumer", "timestamp": 1713077145.4763987}
{"event": "consumer: Cannot connect to redis://redis-authentik:6379/0: Error -2 connecting to redis-authentik:6379. Name or service not known..\nTrying again in 6.00 seconds... (3/100)\n", "level": "error", "logger": "celery.worker.consumer.consumer", "timestamp": 1713077149.5093431}
{"event": "consumer: Cannot connect to redis://redis-authentik:6379/0: Error -2 connecting to redis-authentik:6379. Name or service not known..\nTrying again in 8.00 seconds... (4/100)\n", "level": "error", "logger": "celery.worker.consumer.consumer", "timestamp": 1713077155.543255}
{"event": "TenantAwareScheduler: Sending due task policies_reputation_save (authentik.policies.reputation.tasks.save_reputation) to all tenants", "level": "info", "logger": "tenant_schemas_celery.scheduler", "timestamp": 1713077160.0004525}
Process Beat:
Traceback (most recent call last):
  File "/ak-root/venv/lib/python3.12/site-packages/billiard/process.py", line 323, in _bootstrap
    self.run()
  File "/ak-root/venv/lib/python3.12/site-packages/celery/beat.py", line 718, in run
    self.service.start(embedded_process=True)
  File "/ak-root/venv/lib/python3.12/site-packages/celery/beat.py", line 643, in start
    interval = self.scheduler.tick()
               ^^^^^^^^^^^^^^^^^^^^^
  File "/ak-root/venv/lib/python3.12/site-packages/celery/beat.py", line 353, in tick
    self.apply_entry(entry, producer=self.producer)
  File "/ak-root/venv/lib/python3.12/site-packages/tenant_schemas_celery/scheduler.py", line 97, in apply_entry
    for tenant in tenants:
  File "/ak-root/venv/lib/python3.12/site-packages/django/db/models/query.py", line 400, in __iter__
    self._fetch_all()
  File "/ak-root/venv/lib/python3.12/site-packages/django/db/models/query.py", line 1928, in _fetch_all
    self._result_cache = list(self._iterable_class(self))
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/ak-root/venv/lib/python3.12/site-packages/django/db/models/query.py", line 91, in __iter__
    results = compiler.execute_sql(
              ^^^^^^^^^^^^^^^^^^^^^
  File "/ak-root/venv/lib/python3.12/site-packages/django/db/models/sql/compiler.py", line 1560, in execute_sql
    cursor = self.connection.cursor()
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/ak-root/venv/lib/python3.12/site-packages/django/utils/asyncio.py", line 26, in inner
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/ak-root/venv/lib/python3.12/site-packages/django/db/backends/base/base.py", line 316, in cursor
    return self._cursor()
           ^^^^^^^^^^^^^^
  File "/ak-root/venv/lib/python3.12/site-packages/django_tenants/postgresql_backend/base.py", line 171, in _cursor
    cursor_for_search_path.execute('SET search_path = {0}'.format(','.join(formatted_search_paths)))
  File "/ak-root/venv/lib/python3.12/site-packages/django_prometheus/db/common.py", line 69, in execute
    return super().execute(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/ak-root/venv/lib/python3.12/site-packages/psycopg/cursor.py", line 732, in execute
    raise ex.with_traceback(None)
psycopg.OperationalError: consuming input failed: terminating connection due to administrator command
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
{"event": "consumer: Cannot connect to redis://redis-authentik:6379/0: Error -2 connecting to redis-authentik:6379. Name or service not known..\nTrying again in 10.00 seconds... (5/100)\n", "level": "error", "logger": "celery.worker.consumer.consumer", "timestamp": 1713077163.5772495}
{"event": "/ak-root/venv/lib/python3.12/site-packages/celery/worker/consumer/consumer.py:507: CPendingDeprecationWarning: The broker_connection_retry configuration setting will no longer determine\nwhether broker connection retries are made during startup in Celery 6.0 and above.\nIf you wish to retain the existing behavior for retrying connections on startup,\nyou should set broker_connection_retry_on_startup to True.\n  warnings.warn(\n", "level": "warning", "logger": "py.warnings", "timestamp": 1713077173.5946565}

Version and Deployment:

  • Authentik version: 2024.2.2
  • Deployment: Docker (Unraid template)

Additional context This issue was reported here a few months ago, but never received any response: https://github.com/goauthentik/authentik/issues/7521

BallistiX09 avatar Apr 14 '24 13:04 BallistiX09

Hi,

I can confirm this behaviour too, it goes in "unhealthy" mode in Portainer when I update Redis.

EHRETic avatar Apr 19 '24 06:04 EHRETic

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Hi there,

I tested it again, it is still the case with version 2024.4.2

EHRETic avatar Jun 19 '24 05:06 EHRETic

Can confirm this happening too.

JustJoostNL avatar Jun 22 '24 10:06 JustJoostNL

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Still an issue

JustJoostNL avatar Aug 22 '24 08:08 JustJoostNL

Even though not a full solution, this should be handled by the container's healthcheck command that runs ak healthcheck which checks for any broken connections and will restart the container

BeryJu avatar Aug 22 '24 17:08 BeryJu

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Hi there,

Still the case with 2024.8.3 so please don't close the issue 😊

EHRETic avatar Oct 22 '24 05:10 EHRETic

Also getting this issue, getting annoying.

I have a container to automatically bring up containers not healthy, so it restarts Authentik, and then I get logged out, and the cycle repeats.

Aetherinox avatar Nov 21 '24 04:11 Aetherinox

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

This is still an issue and doesn't seem to have been fixed yet

BallistiX09 avatar Jan 21 '25 13:01 BallistiX09

I confirm it too, still there

EHRETic avatar Jan 21 '25 15:01 EHRETic

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

I believe this is still an issue.

JustJoostNL avatar Mar 23 '25 06:03 JustJoostNL

I just checked, it is still an issue.

EHRETic avatar Mar 23 '25 14:03 EHRETic

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Still seems to be an issue on my end

BallistiX09 avatar May 29 '25 00:05 BallistiX09