grafana icon indicating copy to clipboard operation
grafana copied to clipboard

Alerting: Lot of connection to external Prometheus AlertManager from Grafana while notifying alerts and not being cleaned

Open Raghul-Vasu opened this issue 11 months ago • 10 comments

What happened?

I have Grafana rules enabled with unified Alerting and configured external prometheus AlertManager as contact point.

Every notification to sen alerts makes a new connection to AlertManager. After 3-4 days I have 10k+ connections to AlertManager with state "ESTABLISHED". The OS has problems with open files.

Also, Grafana systems service has settings of max limit files. LimitNOFILE=10000

my system has below status:

# netstat -lntpa | grep 9093 | wc -l
**19979**

Due to the open files(active connections) crossed the limit, subsequently connections to AlertManager failed. and Grafana tried multiple attempt to send notification and failed. Eventually Grafana went down and alert manager was falling behind.

Grafana logs:

logger=ngalert.notifier.prometheus-alertmanager t=2024-03-20T12:21:37.758679171-05:00 level=warn msg="failed to send to Alertmanager" error="Post \"http://admin:9093/api/v1/alerts\": dial tcp 172.23.0.1:9093: socket: too many open files" alertmanager=cp_1 url=http://admin:9093/api/v1/alerts
logger=ngalert.notifier.prometheus-alertmanager t=2024-03-20T12:21:37.758747299-05:00 level=warn msg="all attempts to send to Alertmanager failed" alertmanager=cp_1
logger=alertmanager org=1 t=2024-03-20T12:21:37.758796361-05:00 level=error component=alertmanager orgID=1 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="cp_1/prometheus-alertmanager[0]: notify retry canceled due to unrecoverable error after 1 attempts: failed to send alert to Alertmanager: Post \"http://admin:9093/api/v1/alerts\": dial tcp 172.23.0.1:9093: socket: too many open files"

logger=provisioning.dashboard type=file name=Dashboards t=2024-03-20T12:21:44.865037298-05:00 level=error msg="failed to search for dashboards" error="open /var/lib/grafana/dashboards/cray-EX: too many open files"
logger=provisioning.dashboard type=file name=Dashboards t=2024-03-20T12:21:54.86600488-05:00 level=error msg="failed to search for dashboards" error="open /var/lib/grafana/dashboards/cray-EX: too many open files"

What did you expect to happen?

usually connections to alertmanager from Grafana should get closed after a while but grafana 9.x never close connections, they just open every time it is supposed to send alerts notification.

I believe Grafana 7.x creates one connection and uses that to notify alerts with legacy alerting framework.

Did this work before?

I am not sure about this.

How do we reproduce it?

  1. Add external prometheus alert manager in datasource.yml or UI with settings handleGrafanaManagedAlerts
  • access: proxy jsonData: handleGrafanaManagedAlerts: true implementation: prometheus name: Alertmanager type: alertmanager url: http://admin:9093
  1. Add Alertmanger contact point
  2. create sample rules and try to send to alert manager
  3. or use test notifications to send notification.

Everytime we click send notification, it opens a new communication to AlertManager.

Is the bug inside a dashboard panel?

No

Environment (with versions)?

Grafana: 9.5.5 OS: Linux SLES15sp5 Browser: Chrome, Safari, Firefox (anything)

Grafana platform?

A package manager (APT, YUM, BREW, etc.)

Datasource(s)?

promethues Alertmanager

Raghul-Vasu avatar Mar 20 '24 18:03 Raghul-Vasu

Ran into same issue this morning, also after less than a week of added external alertmanager config. Grafana isn't properly closing the sockets.

Grafana: 10.2.5 OS: Rocky Linux 9.3 (Blue Onyx)

# lsof -p <pid> | grep -i copycat | wc -l
9988
notify retry canceled due to unrecoverable error after 1 attempts: failed to send alert to Alertmanager: Post \"http://***:9093/api/v1/alerts\": dial tcp ***:9093: socket: too many open files"
server.go:3214: http: Accept error: accept tcp [::]:3000: accept4: too many open files; retrying in 1s

obviously raising the file descriptor limit will fix in the short term, but eventually it will run out of sockets as it isn't properly closing them.

Alerting policy and alertmanager config setup via terraform

resource "grafana_notification_policy" "default_notification" {
  group_by      = ["..."]
  contact_point = grafana_contact_point.alertmanagers.name
  group_wait    = "15s"
  group_interval  = "1m"
  repeat_interval = "1m"
}

socket count is increasing by 4 every 1-2 minutes so would run out anywhere from 2-4 days.

spencecopper avatar Mar 20 '24 19:03 spencecopper

PCAP shows Grafana sends the alert, gets its response from alertmanager then sends eternal tcp keep alives every 15 seconds never closing the connection. Grafana then opens up another socket for the next request to /api/v1/alerts never reusing the connection.

Whereas Prometheus -> Alertmanager only uses the one socket for all alerts and continues to reuse that socket for any subsequent alerts.

spencecopper avatar Mar 20 '24 22:03 spencecopper

I am running into the same issue on Grafana v10.4.0, which makes it impossible to RDP into the Windows server until it has been restarted. It is bad considering "Legacy alerting will be removed in Grafana v11.0.0. and it is recommended that we upgrade to Grafana Alerting as soon as possible."

kago-dk avatar Mar 21 '24 02:03 kago-dk

PCAP shows Grafana sends the alert, gets its response from alertmanager then sends eternal tcp keep alives every 15 seconds never closing the connection. Grafana then opens up another socket for the next request to /api/v1/alerts never reusing the connection.

Whereas Prometheus -> Alertmanager only uses the one socket for all alerts and continues to reuse that socket for any subsequent alerts.

I am using opensearch-alerting as well to send alerts to Alertmanager and that uses only one socket for all the alerts and continues to reuse all the time. Only Grafana Alerting has this issue

Raghul-Vasu avatar Mar 21 '24 11:03 Raghul-Vasu

I am running into the same issue on Grafana v10.4.0, which makes it impossible to RDP into the Windows server until it has been restarted. It is bad considering "Legacy alerting will be removed in Grafana v11.0.0. and it is recommended that we upgrade to Grafana Alerting as soon as possible."

I agree with you. I've decided to put a cron job to restart Grafana-server service every midnight to cleanup the active stale socket connections to control the number of connections within limit. ( kind of worst workaround)

Raghul-Vasu avatar Mar 21 '24 11:03 Raghul-Vasu

I also encountered the same problem:Grafana Alert generate a large number of too many open files, causing the Grafana process to feign death

I use Grafana to send alarms to my own alertmanager. After running normally for 3 days, the Grafana service will be abnormal. After restarting the Grafana service, it will recover and check the relevant logs, which prompt socket: too many open files

Grafana Version 10.3.3

zxl181212 avatar Apr 15 '24 11:04 zxl181212

We recommend using external alertmanager instead of the alertmanager integration.

Does this solve your problem?

KaplanSarah avatar May 02 '24 15:05 KaplanSarah

@spencecopper reported using external alertmanager in his comment and that it has the problem that @Raghul-Vasu reported. Based on reading the issue I would say it would not solve the problem.

jhansonhpe avatar May 02 '24 15:05 jhansonhpe

We recommend using external alertmanager instead of the alertmanager integration.

Does this solve your problem?

@KaplanSarah . We use external alertmanager only and still we see this issue.

Bth Internal and external has this issue.

Raghul-Vasu avatar May 03 '24 12:05 Raghul-Vasu

Any plans to fix it?

kago-dk avatar Jun 28 '24 17:06 kago-dk