agent
agent copied to clipboard
Grafana Agent Operator - Logs: Too many open files
Not sure how to reproduce the bug. It happened the first time I tried Grafana Agent Operator, setting a PodLogs configured to retrieve logs for all pods of all namespaces of the K8s cluster.
With PodLogs having a more reduced scope, the bug happens on 1 of my 6 clusters.
Below is the log of the config-reloader container of the grafana-agent-logs pod:
add config file /var/lib/grafana-agent/config-in/agent.yml to watcher: create watcher: too many open files
As a result the grafana-agent container of the grafana-agent-logs pod keeps crashing (CrashLoopBackOff) because of:
error loading config file /var/lib/grafana-agent/config/agent.yml: error reading config file open /var/lib/grafana-agent/config/agent.yml: no such file or directory
Below are my PodLogs:
apiVersion: monitoring.grafana.com/v1alpha1
kind: PodLogs
metadata:
labels:
instance: primary
name: system
namespace: monitoring
spec:
namespaceSelector:
matchNames:
- kube-system
- external-secrets
selector:
matchLabels: {}
---
apiVersion: monitoring.grafana.com/v1alpha1
kind: PodLogs
metadata:
labels:
instance: primary
name: split
namespace: monitoring
spec:
namespaceSelector:
matchNames:
- split
- ingress-nginx
selector:
matchLabels: {}
After 11 hours of CrashloopBackOff, the pod is now running...
Log of the config-reloader container of the grafana-agent-logs pod:
started watching config file and directories for changes" cfg=/var/lib/grafana-agent/config-in/agent.yml out=/var/lib/grafana-agent/config/agent.yml dirs=
Hi Julien! 👋 This looks like an error coming from fsnotify
.
Where are you running your cluster at (is it a managed service from a Cloud provider, an on-prem installation, or just a local cluster on your laptop)? Could you check the relevant system-imposed limits on files and see if it could be the cause?
Hello ! All my clusters are single-node K3s-based (v1.23.5), running on cloud VMs (AWS, GCP, Scaleway). My problem occured on a GCP e2-medium instance.
Hey, apologies for taking too long to get back to you, the notification got lost in all the noise.
I'm not sure what's the case here, but I'd look into a possible K3S issue and the default Linux parameters between the different cloud providers and distros. For example, could it be similar to the issue reported here?
This issue has been automatically marked as stale because it has not had any activity in the past 30 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed in 7 days if there is no new activity. Thank you for your contributions!
A possible workaround to fix this problem would be this: https://kind.sigs.k8s.io/docs/user/known-issues/#pod-errors-due-to-too-many-open-files
I'll leave the issue open for now as it might be a bug in the operator and needs more investigation.
I'm going to close this as won't fix since this doesn't appear to a code problem, and more of an environment problem (e.g., increase fs.inotify limits).
If you go through that workaround and you're still running into issues, please open a new issue so we can track it; updates in closed issues may get missed.
I have the same issue.
kubectl logs -n fluent-bit loki-a2444106-logs-2rbmq
2024/01/18 23:20:21 error loading config file /var/lib/grafana-agent/config/agent.yml: error reading config file open /var/lib/grafana-agent/config/agent.yml: no such file or directory
Fixed with:
sudo sysctl fs.inotify.max_user_watches=524288
sudo sysctl fs.inotify.max_user_instances=512