agent icon indicating copy to clipboard operation
agent copied to clipboard

Grafana Agent Operator - Logs: Too many open files

Open jallaix opened this issue 2 years ago • 6 comments

Not sure how to reproduce the bug. It happened the first time I tried Grafana Agent Operator, setting a PodLogs configured to retrieve logs for all pods of all namespaces of the K8s cluster.

With PodLogs having a more reduced scope, the bug happens on 1 of my 6 clusters.

Below is the log of the config-reloader container of the grafana-agent-logs pod:

add config file /var/lib/grafana-agent/config-in/agent.yml to watcher: create watcher: too many open files

As a result the grafana-agent container of the grafana-agent-logs pod keeps crashing (CrashLoopBackOff) because of:

error loading config file /var/lib/grafana-agent/config/agent.yml: error reading config file open /var/lib/grafana-agent/config/agent.yml: no such file or directory

Below are my PodLogs:

apiVersion: monitoring.grafana.com/v1alpha1
kind: PodLogs
metadata:
  labels:
    instance: primary
  name: system
  namespace: monitoring
spec:
  namespaceSelector:
    matchNames:
    - kube-system
    - external-secrets
  selector:
    matchLabels: {}
---
apiVersion: monitoring.grafana.com/v1alpha1
kind: PodLogs
metadata:
  labels:
    instance: primary
  name: split
  namespace: monitoring
spec:
  namespaceSelector:
    matchNames:
    - split
    - ingress-nginx
  selector:
    matchLabels: {}

jallaix avatar Jul 01 '22 08:07 jallaix

After 11 hours of CrashloopBackOff, the pod is now running...

Log of the config-reloader container of the grafana-agent-logs pod:

started watching config file and directories for changes" cfg=/var/lib/grafana-agent/config-in/agent.yml out=/var/lib/grafana-agent/config/agent.yml dirs=

jallaix avatar Jul 01 '22 08:07 jallaix

Hi Julien! 👋 This looks like an error coming from fsnotify.

Where are you running your cluster at (is it a managed service from a Cloud provider, an on-prem installation, or just a local cluster on your laptop)? Could you check the relevant system-imposed limits on files and see if it could be the cause?

tpaschalis avatar Jul 04 '22 13:07 tpaschalis

Hello ! All my clusters are single-node K3s-based (v1.23.5), running on cloud VMs (AWS, GCP, Scaleway). My problem occured on a GCP e2-medium instance.

jallaix avatar Jul 04 '22 19:07 jallaix

Hey, apologies for taking too long to get back to you, the notification got lost in all the noise.

I'm not sure what's the case here, but I'd look into a possible K3S issue and the default Linux parameters between the different cloud providers and distros. For example, could it be similar to the issue reported here?

tpaschalis avatar Jul 12 '22 15:07 tpaschalis

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed in 7 days if there is no new activity. Thank you for your contributions!

github-actions[bot] avatar Aug 20 '22 00:08 github-actions[bot]

A possible workaround to fix this problem would be this: https://kind.sigs.k8s.io/docs/user/known-issues/#pod-errors-due-to-too-many-open-files

I'll leave the issue open for now as it might be a bug in the operator and needs more investigation.

marctc avatar Sep 06 '22 08:09 marctc

I'm going to close this as won't fix since this doesn't appear to a code problem, and more of an environment problem (e.g., increase fs.inotify limits).

If you go through that workaround and you're still running into issues, please open a new issue so we can track it; updates in closed issues may get missed.

rfratto avatar Nov 03 '22 14:11 rfratto

I have the same issue.

kubectl logs -n fluent-bit loki-a2444106-logs-2rbmq

2024/01/18 23:20:21 error loading config file /var/lib/grafana-agent/config/agent.yml: error reading config file open /var/lib/grafana-agent/config/agent.yml: no such file or directory

Fixed with:

sudo sysctl fs.inotify.max_user_watches=524288
sudo sysctl fs.inotify.max_user_instances=512

omidraha avatar Jan 18 '24 23:01 omidraha