root: fix 100% CPU for worker container (#7025)
Some linux users (arch linux, for example) are running docker with default service file that set NOFILES to infiite, which will cause celery to hang for hours to days taking 100% CPU to close all fds by enumerating from NOFILES to 3.
This commit override ulimit for container without touching user docker service configuration.
For details see #7025
Deploy Preview for authentik-storybook failed.
| Name | Link |
|---|---|
| Latest commit | 3b0a1ac931d51dffd3b864bd8697a45cde2f6fcd |
| Latest deploy log | https://app.netlify.com/sites/authentik-storybook/deploys/656cc0d92b027400084688ae |
Codecov Report
:white_check_mark: All modified and coverable lines are covered by tests.
:white_check_mark: Project coverage is 92.64%. Comparing base (2bc4506) to head (3b0a1ac).
:warning: Report is 6102 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #7762 +/- ##
==========================================
+ Coverage 92.62% 92.64% +0.02%
==========================================
Files 588 588
Lines 29141 29141
==========================================
+ Hits 26991 26997 +6
+ Misses 2150 2144 -6
| Flag | Coverage Δ | |
|---|---|---|
| e2e | 50.72% <ø> (+0.02%) |
:arrow_up: |
| integration | 25.94% <ø> (ø) |
|
| unit | 89.71% <ø> (ø) |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Tried to ping someone, waiting for feedback.
Meanwhile you can easily reproduce by changing the ulimit to 0x3ffffff8 in decimal. That should kinda prove it.
For reference:
Here's what I have at home:
$ ulimit -Sn
1048576
# ulimit -Hn
1048576
And we have the same in production at authentik
Fixes the issue for me.
OS: clear-linux # ulimit -Sn 1024 # ulimit -Hn 524288
Any idea someone how I can change this for my Authentik worker on unraid? It’s been running high cpu usage for days now? Help is appreciated.
Reporting back (unraid-solved): In hind side I did 3 things, not sure what solved it. 1) in the Unraid template I added "-ulimit nofile=10240:10240" in Extra Parameters field as flag (advanced view) 2) redeployed (removing containers and images) both worker and authentik. 3) added AUTHENTIK_REDIS__DB:1 as variable to the unraid template for both Worker and authentik. Now everything seems normal.
Do you know why setting ulimit to a larger number fixes the issue? Is it an issue in Celery or Authentik?
With #7810, #8440 and #7813 this shouldn't be an issue anymore, could you check this again with 2024.2.2 @cenkalti @mobiledude @Leptopoda @DKingAlpha
Thanks for coming back. I can confirm that I no longer need the workaround on 2024.02.2
I still got high cpu usage with latest 2024.2.2. Adding ulimit back to compose.yml fixed the issue for me.
live py profiler py-spy is incompatible with recent py3.12, I will find another way to identify the issue when I have time.
I still experience high CPU usage with the latest 2024.2.2 version. However, I was able to resolve the issue by adding ulimit back to the 'compose.yml' file.
I can confirm that it fixed my setup too so its imho worth merging :tada:
New authentik user here. Tried re-setting Redis, tried setting ulimits in docker-compose, unfortunately CPU still spikes at 100%. After some more troubleshooting I did increase the RAM allocation to the VM (while still leaving custom ulimits in pace), and suddenly it all started to work - no CPU spikes
Same here with 2024.4.2, ulimits in the compose fixed the issue.
Adding ulimits back to compose fixed my issue on 2024.4.2.
For context, the reason why we haven't merged this PR:
- Configuring ulimit values for containers is possible with compose, but not possible with kubernetes, so it would only solve half the problem
- It's also more of a bandaid than a full solution, this ulimit adjustment shouldn't be required and changing the values seems to just prevent a bug in either our code/our usage of celery, or celery itself from happening, instead of fixing the root cause itself.
I can confirm this fixed my issue for docker running on Oracle Linux 9.4
Just deployed a fresh compose install on current Arch linux using 2024.6.3. Celery was taking one core to 100%. Setting ulimits for nofile resolved the issue.
I'm not experiencing this in k8s running on Talos.
Thanks! This fixed the issue for me as well.
Fix worked for me as well
with 2025.8 we no longer use celery so this shouldn't be required anymore