django-DefectDojo uWSGI OOMKilled on Kubernetes

Bug description Deploying Defect Dojo to a Kubernetes cluster causes the uWSGI container to consume a lot of memory resulting in the node killing the pod. This is due to the unbound number of file descriptors on the node. See https://github.com/unbit/uwsgi/issues/2299 for a description of the issue with uWSGI.

Steps to reproduce

Deploy helm chart to a kubernetes cluster with nodes running Flatcar Container Linux by Kinvolk 3602.2.1 (Oklo)
watch the pod get deployed and after <15 sec killed by the node due to OOM.

Expected behavior Expected the pod to start up and not get OOMKilled by the node.

I locally build my own container adding the --max-fd argument to docker/entrypoint-uwsgi.sh and used that image in the my cluster, this resolved the issue.

Deployment method (select with an X)

[ ] Docker Compose
[X] Kubernetes
[ ] GoDojo

Environment information

Kubernetes nodes running:

Kernel Version:             5.15.136-flatcar
OS Image:                   Flatcar Container Linux by Kinvolk 3602.2.1 (Oklo)
Operating System:           linux
Architecture:               amd64
Container Runtime Version:  containerd://1.6.21
Kubelet Version:            v1.28.3
Kube-Proxy Version:         v1.28.3

DefectDojo version: 2.30.4

Logs Logs from the defectdojo-django pod

$ k logs defect-dojo-defectdojo-django
Defaulted container "uwsgi" out of: uwsgi, nginx
[13/Feb/2024 08:50:57] INFO [dojo.models:4295] enabling audit logging
/usr/local/lib/python3.11/site-packages/coreapi/codecs/download.py:5: DeprecationWarning: 'cgi' is deprecated and slated for removal in Python 3.13
  import cgi
System check identified no issues (0 silenced).
*** Starting uWSGI 2.0.23 (64bit) on [Tue Feb 13 08:50:58 2024] ***
compiled with version: 10.2.1 20210110 on 29 January 2024 15:50:06
os: Linux-5.15.136-flatcar #1 SMP Mon Oct 23 16:44:45 -00 2023
nodename: defect-dojo-defectdojo-django
machine: x86_64
clock source: unix
detected number of CPU cores: 4
current working directory: /app
detected binary path: /usr/local/bin/uwsgi
!!! no internal routing support, rebuild with pcre support !!!
*** WARNING: you are running uWSGI without its master process manager ***
your memory page size is 4096 bytes
detected max file descriptor number: 1073741816
lock engine: pthread robust mutexes
thunder lock: disabled (you can enable it with --thunder-lock)
uWSGI http bound on 0.0.0.0:8081 fd 3
spawned uWSGI http 1 (pid: 13)
uwsgi socket 0 bound to UNIX address /run/defectdojo/uwsgi.sock fd 6
Python version: 3.11.4 (main, Aug 16 2023, 05:31:52) [GCC 10.2.1 20210110]
Python main interpreter initialized at 0x7fb82cac7558
python threads support enabled
your server socket listen backlog is limited to 100 connections
your mercy for graceful operations on workers is 60 seconds
mapped 405672 bytes (396 KB) for 15 cores
*** Operational MODE: preforking+threaded ***

note that uWSGI logs detected max file descriptor number: 1073741816 which causes the container to use a lot of memory.

Running the same deployment locally on my kind cluster i get:

Defaulted container "uwsgi" out of: uwsgi, nginx
[16/Feb/2024 08:57:04] INFO [dojo.models:4295] enabling audit logging
System check identified no issues (0 silenced).
*** Starting uWSGI 2.0.23 (64bit) on [Fri Feb 16 08:57:05 2024] ***
compiled with version: 11.2.1 20220219 on 05 February 2024 16:57:27
os: Linux-6.5.11-linuxkit #1 SMP PREEMPT Wed Dec  6 17:08:31 UTC 2023
nodename: defect-dojo-defectdojo-django-7774dcb687-gn5wn
machine: aarch64
clock source: unix
detected number of CPU cores: 10
current working directory: /app
detected binary path: /usr/local/bin/uwsgi
!!! no internal routing support, rebuild with pcre support !!!
*** WARNING: you are running uWSGI without its master process manager ***
your memory page size is 4096 bytes
detected max file descriptor number: 1048576
lock engine: pthread robust mutexes
thunder lock: disabled (you can enable it with --thunder-lock)
uWSGI http bound on 0.0.0.0:8081 fd 3
spawned uWSGI http 1 (pid: 17)
uwsgi socket 0 bound to UNIX address /run/defectdojo/uwsgi.sock fd 6
Python version: 3.11.3 (main, May  3 2023, 08:27:37) [GCC 11.2.1 20220219]
Python main interpreter initialized at 0xffffa64d55c0
python threads support enabled
your server socket listen backlog is limited to 100 connections
your mercy for graceful operations on workers is 60 seconds
mapped 183136 bytes (178 KB) for 4 cores
*** Operational MODE: preforking+threaded ***
[16/Feb/2024 08:57:05] INFO [dojo.models:4295] enabling audit logging
WSGI app 0 (mountpoint='') ready in 0 seconds on interpreter 0xffffa64d55c0 pid: 1 (default app)
*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI worker 1 (pid: 1, cores: 2)
spawned uWSGI worker 2 (pid: 18, cores: 2)

where we see that detected max file descriptor number: 1048576. This is much lower and does not result in a OOMKilled event.

Suggestion

We add the option to include the --max-fd argument with a configurable value to the docker/entrypoint-uwsgi.sh script such that it is possible to set it to set it to a lower value, e.g. 1048576.

Feb 16 '24 09:02 hoeg

Can you please open it as PR? It looks, like you already know the solution :)

Feb 16 '24 10:02 kiblik

Sure! Just wanted to make sure you where willing to accept it 👍

Feb 16 '24 10:02 hoeg

I'm not a moderator (just a regular member of the community) but from what I see deep discussion usually opens under open PR. Based on your description I suppose your fix is quite small (easy to implement), so feel free to do it this way.

Feb 16 '24 11:02 kiblik

Thanks for the insights, I will open a PR 🚀

Feb 16 '24 11:02 hoeg

@hoeg To add on to what @kiblik said - For Helm in particular, we're trying to keep it at a 'generic framework for deploying DefectDojo level - no opinionated to much in any particular direction.

I know we've pushed back on very specific k8s/Helm changes that pushed the Helm towards only working on a specific vendors cloud or specific tech choice (like HA vs not-HA DB).

So, please keep this in mind when creating that PR. We're a project with a very broad community who deploy DefectDojo on everything from a laptop running Kali Linux to auto-scaling k8s and we try to keep a balance between those deployment choices in what we accept into the main repo.

For corner cases or very vendor specific things, we'd prefer the ability to opt-in to that choice while keeping the current default.

Anyway, that's how we try to balance a specific community member need vs the broader community. HTH.

Feb 16 '24 21:02 mtesauro

After removing the CPU limit on the iwsg container and increasing the memory limit to 3Gi, i have no more oomKill.

Now my memory not go higher than 1Gi

Apr 12 '24 07:04 sebglon

Seems like this OOOMKill issue is a k8s config issue. Closing this.

For future readers of this thread, the best place to get advice on running DefectDojo, the OWASP Slack has a broad and active community. Info on Slack is at https://github.com/DefectDojo/django-DefectDojo?tab=readme-ov-file#community-getting-involved-and-updates

Apr 13 '24 19:04 mtesauro

Increasing the memory limit, or even removing the limit, does not help in our case since it exhausts all the memory of the node. This happens even when it gets an entire node for itself. We are running a relatively standard setup using Flatcar images on EC2 nodes. Thus, it would be great if we could control the number of file descriptors utilized by uWSGI as proposed in this https://github.com/DefectDojo/django-DefectDojo/pull/9564

Jun 10 '24 08:06 tmablunar

@tmablunar You're welcome to open a new PR based on the one you referenced/linked :point_up:

Jun 11 '24 01:06 mtesauro

For reference the new PR mentioned above is here https://github.com/DefectDojo/django-DefectDojo/pull/10384

Jun 11 '24 09:06 tmablunar

django-DefectDojo django-DefectDojo copied to clipboard

uWSGI OOMKilled on Kubernetes

django-DefectDojo
django-DefectDojo copied to clipboard