flagsmith icon indicating copy to clipboard operation
flagsmith copied to clipboard

Task processor pod restarts and memory leak (probably caused by many unhealthy threads)

Open ssichynskyi opened this issue 1 year ago • 1 comments

How are you running Flagsmith

  • [ ] Self Hosted with Docker
  • [X] Self Hosted with Kubernetes
  • [ ] SaaS at flagsmith.com
  • [ ] Some other way (add details in description below)

Describe the bug

This issue is quite complex, and maybe it should be split into several other issues, but I post it like that initially. Maybe there's something wrong with the config?

Setup

  • v.0.22.3 (latest so far)
  • Flagsmith hosted with k8s using helm chart. Yaml file changed to .txt to meet the attachment requirements with sensitive data removed. flagsmith.txt
  • External DB is used.
...
2024-01-26 10:59:37.415	task_processor.thread_monitoring WARNING  Writing unhealthy threads: ['Thread-1']
2024-01-26 10:58:55.408	task_processor.thread_monitoring WARNING  Writing unhealthy threads: ['Thread-4']
2024-01-26 10:39:07.089	task_processor.thread_monitoring WARNING  Writing unhealthy threads: ['Thread-2', 'Thread-3', 'Thread-4', 'Thread-5']
2024-01-26 10:38:22.081	task_processor.thread_monitoring WARNING  Writing unhealthy threads: ['Thread-3']
2024-01-26 10:36:06.942	task_processor.thread_monitoring WARNING  Writing unhealthy threads: ['Thread-4']
2024-01-26 10:35:22.934	task_processor.thread_monitoring WARNING  Writing unhealthy threads: ['Thread-1']
...

Symptoms

  1. Very low stability of the pod on which 'task-processor' is running. It seems, it restarts more often when it is used. Average restart rate is ~3-4 per hour. (See the screenshot from GCP). Restarts are done by k8s control and are caused by failed health checks.
  2. When pod does not restart, it looks like a severe memory leak. (See the screenshot from GCP)
  3. There are no meaningful logs that allow to understand what usage is causing this problem (if any).

Recovery

After disabling the task processor Flagsmith instance works fine. No visible changes nor in functionality nor in memory consumption of 2 other pods. So, it looks like task processor is only causing problems and doesn't help anyhow.

        taskProcessor:
          enabled: true

Steps To Reproduce

  1. Setup flagsmith in k8s with task processor enabled
  2. Use normally (some API calls coming from mobile devices)

Expected behavior

  • No massive unhealthy threads. If those are caused by certain calls - clear error messages in logs
  • No task manager pod restarts
  • No memory leaks inside task manager pod

Screenshots

pod-restarts memory-consumption

ssichynskyi avatar Jan 26 '24 12:01 ssichynskyi

Dear Flagsmith team, do you have any update/idea about the root cause of this issue?

ssichynskyi avatar Mar 26 '24 10:03 ssichynskyi

As discussed with @khvn26, we could not reproduce this when pods are correctly configured with limits and requests. We will be adding default values for limits and requests in a future release. See issue here for context.

matthewelwell avatar Jun 11 '24 15:06 matthewelwell