icinga2 icon indicating copy to clipboard operation
icinga2 copied to clipboard

Load explodes after every reload/restart of Icinga 2

Open pgress opened this issue 8 years ago • 49 comments

We got a 2 master-cluster which gets very high load, when the core is reloaded. The two Nodes contain 8vCPUs and 16GB RAM. So the Power shouldn't be a problem at all. We got now about 700 Hosts with about 7000 Services in it.

We debugged this Problem already a little bit and found out, that there is no Check done until 5 to 6 minutes are gone. After this Duration alle checks start together which results in the high load. When we use a single-node instead of an Cluster, then we don't have the Problem. The Checks start immediately after the reload.

  • Version used (icinga2 --version): r2.7.0-1
  • Operating System and version: Debian GNU/Linux 8.9 (jessie)
  • Enabled features (icinga2 feature list): api checker command graphite ido-mysql mainlog notification
  • Icinga Web 2 version and modules (System - About): cube | 1.0.0 graphite | 0.0.0.5 monitoring | 2.4.1
  • Config validation (icinga2 daemon -C):No Problems
  • If you run multiple Icinga 2 instances, the zones.conf file (or icinga2 object list --type Endpoint and icinga2 object list --type Zone) from all affected nodes.
host001:/root # icinga2 object list --type Endpoint
Object 'host001.localdomain' of type 'Endpoint':
  % declared in '/etc/icinga2/zones.conf', lines 1:0-1:47
  * __name = "host001.localdomain"
  * host = "10.0.0.133"
    % = modified in '/etc/icinga2/zones.conf', lines 2:3-2:22
  * log_duration = 86400
  * name = "host001.localdomain"
  * package = "_etc"
  * port = "5665"
  * source_location
    * first_column = 0
    * first_line = 1
    * last_column = 47
    * last_line = 1
    * path = "/etc/icinga2/zones.conf"
  * templates = [ "host001.localdomain" ]
    % = modified in '/etc/icinga2/zones.conf', lines 1:0-1:47
  * type = "Endpoint"
  * zone = ""

Object 'host002.localdomain' of type 'Endpoint':
  % declared in '/etc/icinga2/zones.conf', lines 5:1-5:48
  * __name = "host002.localdomain"
  * host = "10.0.0.134"
    % = modified in '/etc/icinga2/zones.conf', lines 6:3-6:22
  * log_duration = 86400
  * name = "host002.localdomain"
  * package = "_etc"
  * port = "5665"
  * source_location
    * first_column = 1
    * first_line = 5
    * last_column = 48
    * last_line = 5
    * path = "/etc/icinga2/zones.conf"
  * templates = [ "host002.localdomain" ]
    % = modified in '/etc/icinga2/zones.conf', lines 5:1-5:48
  * type = "Endpoint"
  * zone = ""
host001:/root # icinga2 object list --type Zone
Object 'global_zone' of type 'Zone':
  % declared in '/etc/icinga2/zones.conf', lines 13:1-13:25
  * __name = "global_zone"
  * endpoints = null
  * global = true
    % = modified in '/etc/icinga2/zones.conf', lines 14:3-14:15
  * name = "global_zone"
  * package = "_etc"
  * parent = ""
  * source_location
    * first_column = 1
    * first_line = 13
    * last_column = 25
    * last_line = 13
    * path = "/etc/icinga2/zones.conf"
  * templates = [ "global_zone" ]
    % = modified in '/etc/icinga2/zones.conf', lines 13:1-13:25
  * type = "Zone"
  * zone = ""

Object 'master' of type 'Zone':
  % declared in '/etc/icinga2/zones.conf', lines 9:1-9:20
  * __name = "master"
  * endpoints = [ "host001.localdomain", "host002.localdomain" ]
    % = modified in '/etc/icinga2/zones.conf', lines 10:3-10:84
  * global = false
  * name = "master"
  * package = "_etc"
  * parent = ""
  * source_location
    * first_column = 1
    * first_line = 9
    * last_column = 20
    * last_line = 9
    * path = "/etc/icinga2/zones.conf"
  * templates = [ "master" ]
    % = modified in '/etc/icinga2/zones.conf', lines 9:1-9:20
  * type = "Zone"
  * zone = ""

pgress avatar Aug 07 '17 08:08 pgress

Do you have a specific performance analysis including graphs from work queues and enabled features, checks, etc ? It is hard to tell what exactly could cause this without more insights.

https://www.icinga.com/docs/icinga2/latest/doc/15-troubleshooting/#analyze-your-environment

dnsmichi avatar Aug 08 '17 11:08 dnsmichi

Here is a snapshot from some of our Graphs: https://snapshot.raintank.io/dashboard/snapshot/XX5gtmp2yf4nnIXJ1oIpWt1FA263Um2S?orgId=2 It shows the binding between Load and Uptime. Additionally all Perfdata from the Icinga-Check is in the third panel. When I watched it live, i could see that all CPU Cores were at max. Used Memory was getting higher, but not getting full. There were only minor IO on the Disk. I've upload an part of the icinga2-log where we can see that there didn't happen anything some minutes. icinga2-log.txt

pgress avatar Aug 08 '17 11:08 pgress

Thanks for the graphs. The last one uses the icinga check, which provides additional metrics about work queues in 2.7 - do you happen to have some stats/graphs on that too?

Logs look fine, nothing spectacular. This change in cpu load could come from the recent work queue additions for all features, i.e. graphite.

(for future reference - I modified the URL with render/ and added the screenshot here)

xx5gtmp2yf4nnixj1oipwt1fa263um2s

dnsmichi avatar Aug 09 '17 16:08 dnsmichi

Does this happen with 2.8 again? There were certain improvements for the cluster in this release.

Cheers, Michael

dnsmichi avatar Jan 08 '18 13:01 dnsmichi

In my setup yes.

Single Master:

  • CPU: Intel Xeon E3-1230 v5
  • RAM: 32GB DDR4 ECC
  • Storage: 2x 300GB SAS (Root FS) + 1 250GB SSD for Graphite

Our instance contains 512 Hosts an 5368 Services. 5 of them are Icinga2 Clients, the remaining hosts are checked by NRPE.

  • Version used (icinga2 --version): r2.8.0-1
  • Operating System and version: Debian GNU/Linux 8.10 (jessie)
  • Enabled features (icinga2 feature list): api checker command graphite ido-mysql mainlog notification
  • Icinga Web 2 version and modules (System - About): businessprocess | 2.1.0 director | 1.4.2 fileshipper | 0.9.3 graphite | 0.0.0.5 monitoring | 2.5.0 vsphere | 1.1.0
  • Config validation (icinga2 daemon -C): No Problems

2018-01-17 10_37_13-grafana - icinga2 load

uffsalot avatar Jan 17 '18 09:01 uffsalot

In our setup it happens, too. a year ago it was so bad that the master Cluster could not catch anymore. So we had to expand our cluster with some satellites. Most of our checks based on SNMP like check_nwc_health. We have just 7959 Hosts and 16156 Service. I tried to flat the peaks through the limitation of concurrent_checks = 256. but this have no effect on the Version 2,8.0.

Simkimdm avatar Jan 18 '18 08:01 Simkimdm

How many of these services actually invoke check_nwc_health? How's the average execution time and latency for these checks?

dnsmichi avatar Apr 05 '18 14:04 dnsmichi

@dnsmichi: this is a real issue, root cause is our scheduling code in Icinga 2. Have a look at how Icinga 1.x tried to avoid such problems. It was far from being perfect, the 1.x scheduling code is a mess - but it's basics where well thought.

This issue is a combination of checks being rescheduled at restart (that's a bug, shouldn't happen) combined with a rescheduling/splaying logic where every checkable (and not a central scheduling logic) decides on it's own when to run the next check.

There is a related issue in our Icinga Support issue tracker. You'll find it combined with debug log, hints how to filter the log and a link pointing to the place in our code responsible for the "checks are not being executed for a long time" effect as explained by @pgress. Eventually talk to @gunnarbeutner, he should be aware of this issue - we discussed it quite some time ago.

The "reschedule everything when starting" issue should be easy to fix. Most of the heavy spikes shown by @uffsalot would then no longer appear. As you can see he has an environment with not too many monitored objects and not the greatest and latest but absolutely sufficient hardware. Especially given that most of his checks are NRPE-based, he should never experience what his graphs are showing.

Cheers, Thomas

NB: Sooner or later we should consider implementing a scheduler logic taking the amount of active checks, their interval and their average execution time into account. It should try to fairly distribute new checks while respecting and preserving current schedule for existing ones.

Thomas-Gelf avatar Apr 05 '18 22:04 Thomas-Gelf

We were also seeing the same thing when we tryed to upgraded to icinga2 @ miraheze using 1 core and 1gb of ram.

The cpu shot up after checks started running causing OOM errors and high cpu.

screen shot 2018-04-26 at 01 11 52

paladox avatar Apr 26 '18 00:04 paladox

There are some changes to this in current git master and the snapshot packages which will run into 2.9. This is scheduled for June.

dnsmichi avatar May 09 '18 13:05 dnsmichi

@dnsmichi oh to reduce load?

Which changes? :)

paladox avatar May 09 '18 13:05 paladox

To influence check scheduling upon reload. Snapshot packages should be available for testing already.

dnsmichi avatar May 09 '18 13:05 dnsmichi

I have the same problem in a customers setup. I'll try to get some tests / feedback from them as well.

widhalmt avatar May 29 '18 15:05 widhalmt

Im guessing this https://github.com/Icinga/icinga2/commit/1a9c1591c0c13603b1dee6cfb514e6ec7c309450 is the fix.

paladox avatar Jun 01 '18 16:06 paladox

1a9c159 is not about this issue. While it might redistribute the load of early checks.

NB: Sooner or later we should consider implementing a scheduler logic taking the amount of active checks, their interval and their average execution time into account. It should try to fairly distribute new checks while respecting and preserving current schedule for existing ones.

Here we have the problem that we do not know much about the checks and can't make many guesses based on that information. A high execution time does not mean high load, a check with a long check interval might not have that because of it being a heavy check and the amount of active checks tells us nothing concrete either. The only thing that comes to my mind is randomizing execution checks with similar check intervals better.

Crunsher avatar Jun 04 '18 12:06 Crunsher

I'd suggest to test the snapshot packages and report back whether the problem persists or not.

dnsmichi avatar Jul 03 '18 18:07 dnsmichi

Anycast ping, is someone experiencing this issue still with a recent Icinga 2 version?

Crunsher avatar Sep 04 '18 12:09 Crunsher

Might also be related to a setup where the master actively connects to all clients, in #6517.

dnsmichi avatar Sep 14 '18 17:09 dnsmichi

We have experienced less load since using 2.9. We made sure only one check is run at one time.

(this is with nagios-nrpe-server (check_nrpe)) we doin't use icinga2 client.

paladox avatar Sep 14 '18 17:09 paladox

I also can see this with the very latest version 2.9.2. After a config reload, Satelites get a very huge load. image

MarcusCaepio avatar Sep 27 '18 08:09 MarcusCaepio

Any chance you'll try the snapshot packages?

dnsmichi avatar Sep 27 '18 08:09 dnsmichi

Unfortunately not right at the moment, as I don't have an identical dev cluster right now. But if I can help with any further infos (total checks, plugins, etc), I would love to do it :)

MarcusCaepio avatar Sep 27 '18 09:09 MarcusCaepio

I believe that the load is caused by the reconnect timer, or many incoming connections with many separate threads being spawned. A full analysis is available in #6517.

dnsmichi avatar Oct 10 '18 08:10 dnsmichi

Still present in 2.10: Master on Reload: image

Sats on reload image image

MarcusCaepio avatar Oct 15 '18 09:10 MarcusCaepio

Ok, then @Thomas-Gelf was right about the scheduler. I was just guessing from the recent changes, and it is good to know that the possible areas are boiled down with a recent version, thanks.

dnsmichi avatar Oct 15 '18 09:10 dnsmichi

Will this issue get a priority for the next release?

MarcusCaepio avatar Oct 30 '18 15:10 MarcusCaepio

ref/NC/601009

lippserd avatar Jul 01 '19 12:07 lippserd

I can confirm that in r2.11.2-1

Master1: image

image

Master2 (No idea why the first reload had no impact on Master 2): image

It will stack up if i reload in very short intervals

Some Satellites (pretty sure they hav'nt had a config change): image

NeverUsedID avatar Jan 03 '20 14:01 NeverUsedID

In our setup, we are experiencing the same issue. As per our observations, I can see the load goes to double of the number of CPU cores we have. (For 8 cores, it goes to 16). Not sure if this is the case with everyone.

Is there any ETA when this will get fixed? If there is any way temporarily to avoid this issue, please let me know.

samitks avatar Jan 17 '20 16:01 samitks

There's currently no ETA since all resources are bound to the JSON-RPC crash debugging, holding off the 2.12 and IcingaDB release.

dnsmichi avatar Jan 17 '20 18:01 dnsmichi