flagsmith
flagsmith copied to clipboard
Task processor pod restarts and memory leak (probably caused by many unhealthy threads)
How are you running Flagsmith
- [ ] Self Hosted with Docker
- [X] Self Hosted with Kubernetes
- [ ] SaaS at flagsmith.com
- [ ] Some other way (add details in description below)
Describe the bug
This issue is quite complex, and maybe it should be split into several other issues, but I post it like that initially. Maybe there's something wrong with the config?
Setup
- v.0.22.3 (latest so far)
- Flagsmith hosted with k8s using helm chart. Yaml file changed to .txt to meet the attachment requirements with sensitive data removed. flagsmith.txt
- External DB is used.
...
2024-01-26 10:59:37.415 task_processor.thread_monitoring WARNING Writing unhealthy threads: ['Thread-1']
2024-01-26 10:58:55.408 task_processor.thread_monitoring WARNING Writing unhealthy threads: ['Thread-4']
2024-01-26 10:39:07.089 task_processor.thread_monitoring WARNING Writing unhealthy threads: ['Thread-2', 'Thread-3', 'Thread-4', 'Thread-5']
2024-01-26 10:38:22.081 task_processor.thread_monitoring WARNING Writing unhealthy threads: ['Thread-3']
2024-01-26 10:36:06.942 task_processor.thread_monitoring WARNING Writing unhealthy threads: ['Thread-4']
2024-01-26 10:35:22.934 task_processor.thread_monitoring WARNING Writing unhealthy threads: ['Thread-1']
...
Symptoms
- Very low stability of the pod on which 'task-processor' is running. It seems, it restarts more often when it is used. Average restart rate is ~3-4 per hour. (See the screenshot from GCP). Restarts are done by k8s control and are caused by failed health checks.
- When pod does not restart, it looks like a severe memory leak. (See the screenshot from GCP)
- There are no meaningful logs that allow to understand what usage is causing this problem (if any).
Recovery
After disabling the task processor Flagsmith instance works fine. No visible changes nor in functionality nor in memory consumption of 2 other pods. So, it looks like task processor is only causing problems and doesn't help anyhow.
taskProcessor:
enabled: true
Steps To Reproduce
- Setup flagsmith in k8s with task processor enabled
- Use normally (some API calls coming from mobile devices)
Expected behavior
- No massive unhealthy threads. If those are caused by certain calls - clear error messages in logs
- No task manager pod restarts
- No memory leaks inside task manager pod
Screenshots
Dear Flagsmith team, do you have any update/idea about the root cause of this issue?
As discussed with @khvn26, we could not reproduce this when pods are correctly configured with limits and requests. We will be adding default values for limits and requests in a future release. See issue here for context.