helm-charts
helm-charts copied to clipboard
[newrelic-logging] Default resource limits cause out of memory errors
Description
An issue has been opened about this before, and the reporter was instructed to ensure that they had upgraded their chart such that memory limit config on the input was present.
https://github.com/newrelic/helm-charts/blob/ab2d1bab9f09d94ea6ca56fed807dd20eae5444e/charts/newrelic-logging/values.yaml#L104
We have been struggling with OOM errors and restarts on our pods despite having this config present, and upping the memory allowances of the pod. We have about 50 pods per node.
The helm config provided for this was:
newrelic-logging:
enabled: true
fluentBit:
criEnabled: true
lowDataMode: false
resources:
limits:
memory: 256Mi
tolerations:
- effect: NoSchedule
key: role
operator: Exists
| Date | Message |
|---|---|
| 2024-10-08 05:11:23 | Memory cgroup out of memory: Killed process 1360652 (flb-pipeline) total-vm:1307336kB, anon-rss:259736kB, file-rss:19648kB, shmem-rss:0kB, UID:0 pgtables:1104kB oom_score_adj:996 |
| 2024-10-08 05:11:23 | Memory cgroup out of memory: Killed process 1400772 (fluent-bit) total-vm:1311176kB, anon-rss:259508kB, file-rss:19084kB, shmem-rss:0kB, UID:0 pgtables:1028kB oom_score_adj:996 |
| 2024-10-08 05:11:23 | Memory cgroup out of memory: Killed process 1400790 (flb-pipeline) total-vm:1311176kB, anon-rss:259652kB, file-rss:19468kB, shmem-rss:0kB, UID:0 pgtables:1028kB oom_score_adj:996 |
| 2024-10-08 05:11:23 | Memory cgroup out of memory: Killed process 1360626 (fluent-bit) total-vm:1307336kB, anon-rss:259624kB, file-rss:19264kB, shmem-rss:0kB, UID:0 pgtables:1104kB oom_score_adj:996 |
| 2024-10-08 05:11:23 | Memory cgroup out of memory: Killed process 1201131 (flb-pipeline) total-vm:1483464kB, anon-rss:259504kB, file-rss:19828kB, shmem-rss:0kB, UID:0 pgtables:1324kB oom_score_adj:996 |
| 2024-10-08 05:11:23 | Memory cgroup out of memory: Killed process 1201113 (fluent-bit) total-vm:1483464kB, anon-rss:259392kB, file-rss:19444kB, shmem-rss:0kB, UID:0 pgtables:1324kB oom_score_adj:996 |
| 2024-10-08 05:11:23 | Memory cgroup out of memory: Killed process 1266468 (flb-pipeline) total-vm:1487560kB, anon-rss:259188kB, file-rss:19628kB, shmem-rss:0kB, UID:0 pgtables:1344kB oom_score_adj:996 |
| 2024-10-08 05:11:23 | Memory cgroup out of memory: Killed process 1324063 (fluent-bit) total-vm:1487560kB, anon-rss:259368kB, file-rss:19368kB, shmem-rss:0kB, UID:0 pgtables:1348kB oom_score_adj:996 |
| 2024-10-08 05:11:23 | Memory cgroup out of memory: Killed process 1324081 (flb-pipeline) total-vm:1487560kB, anon-rss:259476kB, file-rss:19752kB, shmem-rss:0kB, UID:0 pgtables:1348kB oom_score_adj:996 |
| 2024-10-08 05:11:23 | Memory cgroup out of memory: Killed process 1266420 (fluent-bit) total-vm:1487560kB, anon-rss:259084kB, file-rss:19244kB, shmem-rss:0kB, UID:0 pgtables:1344kB oom_score_adj:996 |
Versions
Helm v3.14.4 Kubernetes (AKS) 1.29.2 Chart: nri-bundle-5.0.81 FluentBit: newrelic/newrelic-fluentbit-output:2.0.0
What happened?
The fluentbit pods were repeatedly killed for using more memory than it's limit, which is set very low. It's CPU was never highly utilised, which does not suggest that the memory increase was due to throttling / not being able to keep up.
What you expected to happen?
The fluentbit should have little to no restarts, and it should never reach 1.5GB of memory used per container.
How to reproduce it?
Using the same versions as listed above, and the same helm values.yaml, deploy an AKS cluster with 50 production workloads per node (2vcpu 8gb) and observe whether there are memory issues.
https://new-relic.atlassian.net/browse/NR-323574
@hero-david Did you have any luck resolving this? We're seeing the same problem with AKS
@hero-david Did you have any luck resolving this? We're seeing the same problem with AKS
No, we have simply upped our VM SKU to 16GB (Required for some of our workloads moving forwards anyway)
@hero-david Thanks for submitting this issue.
We are currently benchmarking fluent bit memory usage for the latest version of our chart using similar resource (2vcpu 8gb) that you have described above. We have been able to reproduce this OOM behaviour for large number of input files (100+), we are working on recommendations for this scenario. We have also noted this upstream issue which describe similar behaviour.
Given that we have a wide range of customers use cases to support including cases with extreme resource limitations , it's unlikely that we would change our default resource configs for this use case.
However we do want to provide recommendations for customers like you who have more log files to ingest from a cluster.
We will follow up here when we have results of our benchmark.
Hi @hero-david
We have analysed this issue in detail and here are our findings:
- As number of files to be tailed increases, fluent bit allocate separate resource for each of the file and pod resources keep increasing.
- Once a particular file has been ingested, still fluent bit will keep allocating resources for this file even though there is no ingest.
- The 128 Mb memory allocated to the pod is not sufficient to handle ingest in 100+ files, at once in our default helm chart config.
- To ingest more number of files, we recommend enabling file system buffering This approach is also suggested in Fluent Bit's official documentation to mitigate data loss. Note that file system buffering is not directly supported through our default Helm chart configurations. To enable these changes, update the config map of Kubernetes cluster accordingly. For more details refer to Fluent bit Buffering and Storage Configuration.
- By enabling filesystem buffering we were able to ingest large number of files at once using the default memory settings.
- Attaching a sample configuration for the same:
newrelic-logging:
enabled: true
fluentBit:
config:
service: |
[SERVICE]
Flush 1
Log_Level ${LOG_LEVEL}
Daemon off
Parsers_File parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
storage.path /var/log/flb-storage/
storage.sync full
inputs: |
[INPUT]
Name tail
Alias pod-logs-tailer
Tag kube.*
Path ${PATH}
multiline.parser ${LOG_PARSER}
DB ${FB_DB}
Mem_Buf_Limit 7MB
Skip_Long_Lines On
Refresh_Interval 10
storage.type filesystem
storage.max_chunks_up 128
storage.pause_on_chunks_overlimit On
Note: While enabling filesystem buffering is recommended to handle heavy ingest, this will cause extra memory usage on the node, since the chunks will be stored in the storage.path mentioned above. You need to tune the parameter storage.max_chunks_up based on your node memory availability.