authentik icon indicating copy to clipboard operation
authentik copied to clipboard

root: fix 100% CPU for worker container (#7025)

Open DKingAlpha opened this issue 2 years ago • 18 comments

Some linux users (arch linux, for example) are running docker with default service file that set NOFILES to infiite, which will cause celery to hang for hours to days taking 100% CPU to close all fds by enumerating from NOFILES to 3.

This commit override ulimit for container without touching user docker service configuration.

For details see #7025

DKingAlpha avatar Dec 03 '23 17:12 DKingAlpha

Deploy Preview for authentik-storybook failed.

Name Link
Latest commit 3b0a1ac931d51dffd3b864bd8697a45cde2f6fcd
Latest deploy log https://app.netlify.com/sites/authentik-storybook/deploys/656cc0d92b027400084688ae

netlify[bot] avatar Dec 03 '23 17:12 netlify[bot]

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 92.64%. Comparing base (2bc4506) to head (3b0a1ac). :warning: Report is 6102 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7762      +/-   ##
==========================================
+ Coverage   92.62%   92.64%   +0.02%     
==========================================
  Files         588      588              
  Lines       29141    29141              
==========================================
+ Hits        26991    26997       +6     
+ Misses       2150     2144       -6     
Flag Coverage Δ
e2e 50.72% <ø> (+0.02%) :arrow_up:
integration 25.94% <ø> (ø)
unit 89.71% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Dec 03 '23 18:12 codecov[bot]

Tried to ping someone, waiting for feedback.

Meanwhile you can easily reproduce by changing the ulimit to 0x3ffffff8 in decimal. That should kinda prove it.

DKingAlpha avatar Dec 03 '23 18:12 DKingAlpha

For reference:

Here's what I have at home:

$ ulimit -Sn
1048576
# ulimit -Hn
1048576

And we have the same in production at authentik

rissson avatar Dec 04 '23 04:12 rissson

Fixes the issue for me.

OS: clear-linux
# ulimit -Sn
1024
# ulimit -Hn
524288

Any idea someone how I can change this for my Authentik worker on unraid? It’s been running high cpu usage for days now? Help is appreciated.

Reporting back (unraid-solved): In hind side I did 3 things, not sure what solved it. 1) in the Unraid template I added "-ulimit nofile=10240:10240" in Extra Parameters field as flag (advanced view) 2) redeployed (removing containers and images) both worker and authentik. 3) added AUTHENTIK_REDIS__DB:1 as variable to the unraid template for both Worker and authentik. Now everything seems normal.

mobiledude avatar Jan 03 '24 07:01 mobiledude

Do you know why setting ulimit to a larger number fixes the issue? Is it an issue in Celery or Authentik?

cenkalti avatar Jan 03 '24 22:01 cenkalti

With #7810, #8440 and #7813 this shouldn't be an issue anymore, could you check this again with 2024.2.2 @cenkalti @mobiledude @Leptopoda @DKingAlpha

BeryJu avatar Mar 15 '24 16:03 BeryJu

Thanks for coming back. I can confirm that I no longer need the workaround on 2024.02.2

Leptopoda avatar Mar 15 '24 23:03 Leptopoda

I still got high cpu usage with latest 2024.2.2. Adding ulimit back to compose.yml fixed the issue for me.

live py profiler py-spy is incompatible with recent py3.12, I will find another way to identify the issue when I have time.

DKingAlpha avatar Mar 18 '24 17:03 DKingAlpha

I still experience high CPU usage with the latest 2024.2.2 version. However, I was able to resolve the issue by adding ulimit back to the 'compose.yml' file.

cenkalti avatar Mar 18 '24 18:03 cenkalti

I can confirm that it fixed my setup too so its imho worth merging :tada:

MyIgel avatar Mar 20 '24 10:03 MyIgel

New authentik user here. Tried re-setting Redis, tried setting ulimits in docker-compose, unfortunately CPU still spikes at 100%. After some more troubleshooting I did increase the RAM allocation to the VM (while still leaving custom ulimits in pace), and suddenly it all started to work - no CPU spikes

SpiderD555 avatar Mar 26 '24 12:03 SpiderD555

Same here with 2024.4.2, ulimits in the compose fixed the issue.

Janhouse avatar May 13 '24 00:05 Janhouse

Adding ulimits back to compose fixed my issue on 2024.4.2.

arthurlockman avatar Jun 01 '24 06:06 arthurlockman

For context, the reason why we haven't merged this PR:

  • Configuring ulimit values for containers is possible with compose, but not possible with kubernetes, so it would only solve half the problem
  • It's also more of a bandaid than a full solution, this ulimit adjustment shouldn't be required and changing the values seems to just prevent a bug in either our code/our usage of celery, or celery itself from happening, instead of fixing the root cause itself.

BeryJu avatar Jun 01 '24 08:06 BeryJu

I can confirm this fixed my issue for docker running on Oracle Linux 9.4

ForsakenRei avatar Jun 05 '24 13:06 ForsakenRei

Just deployed a fresh compose install on current Arch linux using 2024.6.3. Celery was taking one core to 100%. Setting ulimits for nofile resolved the issue.

I'm not experiencing this in k8s running on Talos.

cubic3d avatar Aug 08 '24 12:08 cubic3d

Thanks! This fixed the issue for me as well.

justin8 avatar Oct 09 '24 10:10 justin8

Fix worked for me as well

maxnoe avatar Jan 25 '25 14:01 maxnoe

with 2025.8 we no longer use celery so this shouldn't be required anymore

BeryJu avatar Oct 10 '25 11:10 BeryJu