authentik
authentik copied to clipboard
Worker container on Docker causes high cpu load
Describe the bug
Right after starting up my docker-compose setup based on the given docker-compose.yml
file the worker-container causes high cpu load. It looks like the system tasks will be fired continuously every second.
To Reproduce Steps to reproduce the behavior:
- Run
docker-compose up
- Run
docker-compose top
- You'll find one worker process with high cpu load (in my setup)
- Run
docker-compose stop worker
- High cpu load has gone
Expected behavior I would expect the system tasks would not be fired every second or continuously and not consuming so much cpu.
Screenshots
You can see the high cpu load in btop
.
Logs
$ docker-compose top
sso-geoipupdate-1
UID PID PPID C STIME TTY TIME CMD
root 18742 18633 0 11:44 ? 00:00:00 /bin/sh /usr/bin/entry.sh
root 18991 18742 0 11:44 ? 00:00:00 sleep 28800
sso-redis-1
UID PID PPID C STIME TTY TIME CMD
kdcproxy 18613 18538 0 11:44 ? 00:00:01 redis-server *:6379
sso-server-1
UID PID PPID C STIME TTY TIME CMD
adminis+ 18735 18677 0 11:44 ? 00:00:00 /bin/bash -e /lifecycle/ak server
adminis+ 19009 18735 0 11:44 ? 00:00:01 /authentik-proxy
adminis+ 19015 19009 0 11:44 ? 00:00:00 /usr/local/bin/python /usr/local/bin/gunicorn -c ./lifecycle/gunicorn.conf.py authentik.root.asgi:application
adminis+ 19026 19015 1 11:44 ? 00:00:05 /usr/local/bin/python /usr/local/bin/gunicorn -c ./lifecycle/gunicorn.conf.py authentik.root.asgi:application
adminis+ 19028 19015 1 11:44 ? 00:00:05 /usr/local/bin/python /usr/local/bin/gunicorn -c ./lifecycle/gunicorn.conf.py authentik.root.asgi:application
sso-worker-1
UID PID PPID C STIME TTY TIME CMD
root 21788 21766 0 11:50 ? 00:00:00 /bin/bash -e /lifecycle/ak worker
adminis+ 21880 21788 55 11:50 ? 00:00:15 /usr/local/bin/python /usr/local/bin/celery -A authentik.root.celery worker -Ofair --max-tasks-per-child=1 --autoscale 3,1 -E -B -s /tmp/celerybeat-schedule -Q authentik,authentik_scheduled,authentik_events
adminis+ 21888 21880 6 11:50 ? 00:00:01 /usr/local/bin/python /usr/local/bin/celery -A authentik.root.celery worker -Ofair --max-tasks-per-child=1 --autoscale 3,1 -E -B -s /tmp/celerybeat-schedule -Q authentik,authentik_scheduled,authentik_events
adminis+ 22743 21880 8 11:51 ? 00:00:00 /usr/local/bin/python /usr/local/bin/celery -A authentik.root.celery worker -Ofair --max-tasks-per-child=1 --autoscale 3,1 -E -B -s /tmp/celerybeat-schedule -Q authentik,authentik_scheduled,authentik_events
adminis+ 22744 21880 0 11:51 ? 00:00:00 /usr/local/bin/python /usr/local/bin/celery -A authentik.root.celery worker -Ofair --max-tasks-per-child=1 --autoscale 3,1 -E -B -s /tmp/celerybeat-schedule -Q authentik,authentik_scheduled,authentik_events
Version and Deployment (please complete the following information):
- authentik version: 2022.6.3
- Deployment: docker-compose
Additional context Right now I cannot find any errors in the logs. I use JumpCloud as a LDAP-backend.
Any help or assistance is appreciated.
KR, MiGoller
I see the same behavior since updating to 2022.7.2
. The worker container is using 100% CPU continuously
docker stats authentik_worker
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
26dc96fb423d authentik_worker 100.19% 222.1MiB / 15.41GiB 1.41% 976kB / 1.6MB 5.95MB / 8.19kB 4
Please post the logs, none of the tasks are scheduled to run that often so the only way this could happen is if there are a lot of tasks in the queue, and in that case the CPU should eventually go down after they are all done. (I was also not able to reproduce this on compose or k8s)
These are the logs after recreating the worker container, CPU stays at 100% without any updates to the logs.
docker logs authentik_worker
{"event": "Loaded config", "level": "debug", "logger": "authentik.lib.config", "timestamp": 1657304349.2430956, "file": "/authentik/lib/default.yml"}
{"event": "Loaded environment variables", "level": "debug", "logger": "authentik.lib.config", "timestamp": 1657304349.2435303, "count": 13}
{"event": "Starting authentik bootstrap", "level": "info", "logger": "__main__", "timestamp": 1657304349.243698}
{"event": "Finished authentik bootstrap", "level": "info", "logger": "__main__", "timestamp": 1657304349.2512321}
{"event": "Bootstrap completed", "level": "info", "logger": "bootstrap"}
{"event": "Not running as root, disabling permission fixes", "level": "info", "logger": "bootstrap"}
{"event": "Loaded config", "level": "debug", "logger": "authentik.lib.config", "timestamp": 1657304349.9527614, "file": "/authentik/lib/default.yml"}
{"event": "Loaded environment variables", "level": "debug", "logger": "authentik.lib.config", "timestamp": 1657304349.9530528, "count": 13}
{"event": "Booting authentik", "level": "info", "logger": "authentik.root.settings", "timestamp": 1657304350.2223425, "version": "2022.7.2"}
{"event": "Failed to load GeoIP database", "exc": "FileNotFoundError(2, 'No such file or directory')", "level": "warning", "logger": "authentik.events.geo", "pid": 8, "timestamp": "2022-07-08T18:19:10.672939"}
-------------- celery@eb46e4aafa75 v5.2.7 (dawn-chorus)
--- ***** -----
-- ******* ---- Linux-5.18.9-200.fc36.x86_64-x86_64-with-glibc2.31 2022-07-08 18:19:11
- *** --- * ---
- ** ---------- [config]
- ** ---------- .> app: authentik:0x7f6a04d95fc0
- ** ---------- .> transport: redis://authentik_redis:6379/1
- ** ---------- .> results: redis://authentik_redis:6379/1
- *** --- * --- .> concurrency: {min=1, max=3} (prefork)
-- ******* ---- .> task events: ON
--- ***** -----
-------------- [queues]
.> authentik exchange=authentik(direct) key=authentik
.> authentik_events exchange=authentik_events(direct) key=authentik_events
.> authentik_scheduled exchange=authentik_scheduled(direct) key=authentik_scheduled
{"event": "/usr/local/lib/python3.10/site-packages/celery/app/utils.py:204: CDeprecationWarning: \n The 'CELERY_RESULT_BACKEND' setting is deprecated and scheduled for removal in\n version 6.0.0. Use the result_backend instead\n\n deprecated.warn(description=f'The {setting!r} setting',\n", "level": "warning", "logger": "py.warnings", "timestamp": 1657304351.8882895}
{"event": "Please run `celery upgrade settings path/to/settings.py` to avoid these warnings and to allow a smoother upgrade to Celery 6.0.", "level": "warning", "logger": "celery.apps.worker", "timestamp": 1657304351.8884904}
{"event": "Dispatching startup tasks...", "level": "info", "logger": "authentik.root.celery", "pid": 8, "timestamp": "2022-07-08T18:19:13.013312"}
{"event": "Task published", "level": "info", "logger": "authentik.root.celery", "pid": 8, "task_id": "a619e62a-0c63-4951-a83e-7d2a0624042a", "task_name": "authentik.admin.tasks.clear_update_notifications", "timestamp": "2022-07-08T18:19:13.016323"}
{"event": "Task published", "level": "info", "logger": "authentik.root.celery", "pid": 8, "task_id": "49c02b90-4e7b-4f68-8a1f-b6a0f5d11528", "task_name": "authentik.outposts.tasks.outpost_local_connection", "timestamp": "2022-07-08T18:19:13.017514"}
{"event": "Task published", "level": "info", "logger": "authentik.root.celery", "pid": 8, "task_id": "23b32963-e66e-464d-9a3f-978ce97bca29", "task_name": "authentik.outposts.tasks.outpost_controller_all", "timestamp": "2022-07-08T18:19:13.018525"}
{"event": "Task published", "level": "info", "logger": "authentik.root.celery", "pid": 8, "task_id": "4afcebdc-9eaa-4209-af64-a65b61aa6644", "task_name": "authentik.providers.proxy.tasks.proxy_set_defaults", "timestamp": "2022-07-08T18:19:13.019494"}
{"event": "Task published", "level": "info", "logger": "authentik.root.celery", "pid": 8, "task_id": "913aae1e-e893-4e5f-84ca-aed453476819", "task_name": "authentik.managed.tasks.managed_reconcile", "timestamp": "2022-07-08T18:19:13.020462"}
{"event": "Task started", "level": "info", "logger": "authentik.root.celery", "pid": 10, "request_id": "task-a619e62a0c634951a83e7d2a0624042a", "task_id": "a619e62a-0c63-4951-a83e-7d2a0624042a", "task_name": "clear_update_notifications", "timestamp": "2022-07-08T18:19:13.145033"}
{"event": "Task finished", "level": "info", "logger": "authentik.root.celery", "pid": 10, "request_id": "task-a619e62a0c634951a83e7d2a0624042a", "state": "SUCCESS", "task_id": "a619e62a-0c63-4951-a83e-7d2a0624042a", "task_name": "clear_update_notifications", "timestamp": "2022-07-08T18:19:13.157729"}
{"event": "Task started", "level": "info", "logger": "authentik.root.celery", "pid": 12, "request_id": "task-49c02b904e7b4f688a1fb6a0f5d11528", "task_id": "49c02b90-4e7b-4f68-8a1f-b6a0f5d11528", "task_name": "outpost_local_connection", "timestamp": "2022-07-08T18:19:13.278517"}
{"event": "Task finished", "level": "info", "logger": "authentik.root.celery", "pid": 12, "request_id": "task-49c02b904e7b4f688a1fb6a0f5d11528", "state": "SUCCESS", "task_id": "49c02b90-4e7b-4f68-8a1f-b6a0f5d11528", "task_name": "outpost_local_connection", "timestamp": "2022-07-08T18:19:13.281736"}
{"event": "Task started", "level": "info", "logger": "authentik.root.celery", "pid": 13, "request_id": "task-913aae1ee8934e5f84caaed453476819", "task_id": "913aae1e-e893-4e5f-84ca-aed453476819", "task_name": "managed_reconcile", "timestamp": "2022-07-08T18:19:13.308577"}
{"event": "Task published", "level": "info", "logger": "authentik.root.celery", "pid": 13, "request_id": "task-913aae1ee8934e5f84caaed453476819", "task_id": "8f921df0-2745-4e6d-9b83-b5b0823f55da", "task_name": "authentik.outposts.tasks.outpost_post_save", "timestamp": "2022-07-08T18:19:13.331067"}
{"event": "Task finished", "level": "info", "logger": "authentik.root.celery", "pid": 13, "request_id": "task-913aae1ee8934e5f84caaed453476819", "state": "SUCCESS", "task_id": "913aae1e-e893-4e5f-84ca-aed453476819", "task_name": "managed_reconcile", "timestamp": "2022-07-08T18:19:13.426600"}
{"event": "Task started", "level": "info", "logger": "authentik.root.celery", "pid": 14, "request_id": "task-4afcebdc9eaa4209af64a65b61aa6644", "task_id": "4afcebdc-9eaa-4209-af64-a65b61aa6644", "task_name": "proxy_set_defaults", "timestamp": "2022-07-08T18:19:13.638166"}
{"event": "Task published", "level": "info", "logger": "authentik.root.celery", "pid": 14, "request_id": "task-4afcebdc9eaa4209af64a65b61aa6644", "task_id": "09df6382-7243-4dda-b76c-04148b25f181", "task_name": "authentik.outposts.tasks.outpost_post_save", "timestamp": "2022-07-08T18:19:13.668112"}
{"event": "Task finished", "level": "info", "logger": "authentik.root.celery", "pid": 14, "request_id": "task-4afcebdc9eaa4209af64a65b61aa6644", "state": "SUCCESS", "task_id": "4afcebdc-9eaa-4209-af64-a65b61aa6644", "task_name": "proxy_set_defaults", "timestamp": "2022-07-08T18:19:13.670962"}
{"event": "Task started", "level": "info", "logger": "authentik.root.celery", "pid": 15, "request_id": "task-09df638272434ddab76c04148b25f181", "task_id": "09df6382-7243-4dda-b76c-04148b25f181", "task_name": "outpost_post_save", "timestamp": "2022-07-08T18:19:13.679953"}
{"event": "Task finished", "level": "info", "logger": "authentik.root.celery", "pid": 15, "request_id": "task-09df638272434ddab76c04148b25f181", "state": "SUCCESS", "task_id": "09df6382-7243-4dda-b76c-04148b25f181", "task_name": "outpost_post_save", "timestamp": "2022-07-08T18:19:13.747521"}
{"event": "Task started", "level": "info", "logger": "authentik.root.celery", "pid": 18, "request_id": "task-8f921df027454e6d9b83b5b0823f55da", "task_id": "8f921df0-2745-4e6d-9b83-b5b0823f55da", "task_name": "outpost_post_save", "timestamp": "2022-07-08T18:19:13.950158"}
{"event": "Task published", "level": "info", "logger": "authentik.root.celery", "pid": 18, "request_id": "task-8f921df027454e6d9b83b5b0823f55da", "task_id": "40fc5d97-51cb-41a7-9937-0163140431ba", "task_name": "authentik.outposts.tasks.outpost_controller", "timestamp": "2022-07-08T18:19:13.969149"}
{"event": "Task finished", "level": "info", "logger": "authentik.root.celery", "pid": 18, "request_id": "task-8f921df027454e6d9b83b5b0823f55da", "state": "SUCCESS", "task_id": "8f921df0-2745-4e6d-9b83-b5b0823f55da", "task_name": "outpost_post_save", "timestamp": "2022-07-08T18:19:14.028241"}
{"event": "Task started", "level": "info", "logger": "authentik.root.celery", "pid": 19, "request_id": "task-40fc5d9751cb41a799370163140431ba", "task_id": "40fc5d97-51cb-41a7-9937-0163140431ba", "task_name": "outpost_controller", "timestamp": "2022-07-08T18:19:14.102926"}
{"event": "Task finished", "level": "info", "logger": "authentik.root.celery", "pid": 19, "request_id": "task-40fc5d9751cb41a799370163140431ba", "state": "SUCCESS", "task_id": "40fc5d97-51cb-41a7-9937-0163140431ba", "task_name": "outpost_controller", "timestamp": "2022-07-08T18:19:14.114605"}
{"event": "Task started", "level": "info", "logger": "authentik.root.celery", "pid": 22, "request_id": "task-23b32963e66e464d9a3f978ce97bca29", "task_id": "23b32963-e66e-464d-9a3f-978ce97bca29", "task_name": "outpost_controller_all", "timestamp": "2022-07-08T18:19:14.498391"}
{"event": "Task finished", "level": "info", "logger": "authentik.root.celery", "pid": 22, "request_id": "task-23b32963e66e464d9a3f978ce97bca29", "state": "SUCCESS", "task_id": "23b32963-e66e-464d-9a3f-978ce97bca29", "task_name": "outpost_controller_all", "timestamp": "2022-07-08T18:19:14.508164"}
Try clearing the redis cache (if the container has no storage mounted, you can just re-create it) and see if it still happens
Recreated both redis and the worker container, but nothing changed.
Same issue here. Running in a compose stack. No change after restarting redis (no persistent data).
The next beta build in ~20 minutes will include the commit above, which adds the ak worker-status
command into the image. This can be used to run Flower, which is a web dashboard that shows which tasks are currently running. This should help with identifying which tasks are using all the resources.
Create a file called docker-compose.override.yaml
in the folder of your compose file and paste this
---
version: '3.4'
services:
flower:
image: ${AUTHENTIK_IMAGE:-ghcr.io/goauthentik/server}:${AUTHENTIK_TAG:-2022.8.1}
restart: unless-stopped
command: worker-status
environment:
AUTHENTIK_REDIS__HOST: redis
AUTHENTIK_POSTGRESQL__HOST: postgresql
AUTHENTIK_POSTGRESQL__USER: ${PG_USER:-authentik}
AUTHENTIK_POSTGRESQL__NAME: ${PG_DB:-authentik}
AUTHENTIK_POSTGRESQL__PASSWORD: ${PG_PASS}
# AUTHENTIK_ERROR_REPORTING__ENABLED: "true"
# WORKERS: 2
env_file:
- .env
ports:
- "0.0.0.0:${AUTHENTIK_PORT_HTTP:-9001}:9000"
Afterwards, you can access port 9001 and see which tasks are running
Thank you, but where do I find beta images? Do I have to use a special image tag different from 2022.8.1
?
Obviously I'm running on an "old" 2022.8.1
image, right?
sso-flower-1 | {"event": "Loaded config", "level": "debug", "logger": "authentik.lib.config", "timestamp": 1660853410.274422, "file": "/authentik/lib/default.yml"}
sso-flower-1 | {"event": "Loaded environment variables", "level": "debug", "logger": "authentik.lib.config", "timestamp": 1660853410.2749135, "count": 22}
sso-flower-1 | {"event": "Booting authentik", "level": "info", "logger": "authentik.root.settings", "timestamp": 1660853410.958196, "version": "2022.8.1"}
sso-flower-1 | {"event": "Failed to load GeoIP database", "exc": "FileNotFoundError(2, 'No such file or directory')", "level": "warning", "logger": "authentik.events.geo", "pid": 10, "timestamp": "2022-08-18T20:10:11.277132"}
sso-flower-1 | {"event": "Task published", "level": "info", "logger": "authentik.root.celery", "pid": 10, "task_id": "b7c5eadb-dc39-427a-8adf-e751298c62c9", "task_name": "authentik.blueprints.v1.tasks.blueprints_discover", "timestamp": "2022-08-18T20:10:11.709301"}
sso-flower-1 | Unknown command: 'worker-status'
sso-flower-1 | Type 'manage.py help' for usage.
sso-flower-1 exited with code 1
But it looks like there's something wrong with the path to the GeoIP database. I'll chek the path.
Even using a new and clean volume will produce the error Failed to load GeoIP database
.
See https://goauthentik.io/docs/installation/beta for beta images; also the GeoIP error should be unrelated to this
That did the trick.
After about 3 minutes Flower has shown over 1,100 successful tasks; almost any task is named authentik.blueprints.v1.tasks.apply_blueprint
, which are fired at least two times each second.
Processed number of completed tasks
authentik.events.tasks.event_notification_handler | 2 |
---|---|
authentik.events.tasks.event_trigger_handler | 11 |
authentik.core.tasks.clean_temporary_users | 5 |
authentik.policies.reputation.tasks.save_reputation | 7 |
authentik.blueprints.v1.tasks.apply_blueprint | 1043 |
authentik.core.tasks.clean_expired_models | 4 |
authentik.outposts.tasks.outpost_service_connection_monitor | 2 |
authentik.sources.plex.tasks.check_plex_token_all | 1 |
authentik.blueprints.v1.tasks.blueprints_discover | 26 |
authentik.outposts.tasks.outpost_post_save | 15 |
authentik.outposts.tasks.outpost_controller | 5 |
authentik.events.tasks.notification_cleanup | 1 |
authentik.outposts.tasks.outpost_local_connection | 4 |
authentik.admin.tasks.clear_update_notifications | 3 |
authentik.outposts.tasks.outpost_controller_all | 3 |
authentik.providers.proxy.tasks.proxy_set_defaults | 3 |
authentik.outposts.tasks.outpost_service_connection_state | 2 |
authentik.crypto.tasks.certificate_discovery | 1 |
Again about 2 minutes later there are only a few tasks per minute and the cpu usage looks quite good again.
That is very interesting, I assume you upgraded to 8.1 and havent run a worker since the upgrade so a lot of those might have been queued? Does the pattern of CPU load repeat afterwards?
Almost right. I've upgraded from 7.3 to 8.1 and haven't run a worker. But I had to switch to the beta image and after bringing the stack up again there still were about 1,100 tasks in queue.
It looks like all the tasks have completed successfully. The CPU load return to known load after the "boot storm". ;-) The heavy and continuing CPU load pattern does not repeat anymore. Worker and Flower are still running and everything looks fine.
Shall I switch back to stable release / image or do you need further information for investigation?
Feel free to run whichever version, I'm curious to see if the problem happens again on 2022.8.1 since that was still causing issues from what I understood?
No change on my side running the beta image. CPU still sits at 100%, without Flower showing any active tasks, only 11 succeeded.
docker stats da0b8a6a3212
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
da0b8a6a3212 authentik_worker-1 99.92% 368.4MiB / 15.51GiB 2.32% 1.62MB / 2.28MB 0B / 8.19kB 17
docker top da0b8a6a3212
PID PPID C STIME TTY TIME CMD
357810 357723 0 17:14 ? 00:00:00 /usr/local/bin/dumb-init -- /lifecycle/ak worker
358594 357810 0 17:14 ? 00:00:00 /bin/bash -e /lifecycle/ak worker
359387 358594 4 17:14 ? 00:00:16 /usr/local/bin/python /usr/local/bin/celery -A authentik.root.celery worker -Ofair --max-tasks-per-child=1 --autoscale 3,1 -E -B -s /tmp/celerybeat-schedule -Q authentik,authentik_scheduled,authentik_events
359891 359387 97 17:14 ? 00:06:25 /usr/local/bin/python /usr/local/bin/celery -A authentik.root.celery worker -Ofair --max-tasks-per-child=1 --autoscale 3,1 -E -B -s /tmp/celerybeat-schedule -Q authentik,authentik_scheduled,authentik_events
361219 357723 0 17:15 pts/0 00:00:00 bash
361552 361219 0 17:15 pts/0 00:00:00 /bin/bash -e /lifecycle/ak worker-status
361555 361552 1 17:15 pts/0 00:00:03 /usr/local/bin/python /usr/local/bin/celery -A authentik.root.celery flower --port=9000
361823 359387 0 17:16 ? 00:00:00 /usr/local/bin/python /usr/local/bin/celery -A authentik.root.celery worker -Ofair --max-tasks-per-child=1 --autoscale 3,1 -E -B -s /tmp/celerybeat-schedule -Q authentik,authentik_scheduled,authentik_events
Switching to beta and recreating the redis contianer did solve the issue. However, it does appear that over a hundred blueprint tasks are being spanwed and completed every container start. Should this be happening every time?
Follow up: CPU spiked up to 300% (100% on 3 cores) when I came back to it today. Flower was already disabled so I couldn't look at it, so I've re-enabled it and recreated the containers to investigate later if it happens again.
Results after: 300% again, with these tasks (unchanging for a while):
Same issue here. Running 2022.8.2 stable. Deleting Redis cache did not change anything.
Upgrading to 2022.09 solved the issue for me.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This problem is happening for me still, I'm getting an endless loop of
12:04:27.998476 close(1033778868) = -1 EBADF (Bad file descriptor) <0.000003>
in strace. Not sure what's going on there.
What version are you running? Can you restart the container with Flower enabled as described in this thread and see if there are many events running?
You can also try to disable all blueprints and restart the container to see if the issue is still present. If it is still happening, the problem is likely unrelated to this issue (which seemed to be mostly caused by blueprint application tasks).
I've opened a new issue for this at #4707 . Flower reports no tasks running. It appears to be an issue specifically with celerybeat.
Well, reverting to version 2023.2
solves the problem for me.
I've opened a new issue for this at #4707 . Flower reports no tasks running. It appears to be an issue specifically with celerybeat.
Still an issue with 2023.8.3
https://github.com/goauthentik/authentik/issues/7025