simplemonitor
simplemonitor copied to clipboard
False Positive
Hi, since a few months I'm SimpleMonitoring 150+ hosts from a Windows Server. Very basic just ping every 1 min plus Pushover notifications and HTML status page:
[HostName]
type=ping
host=172.x.y.z
tolerance=5
It works fine but I've realized when an Host is down for long time another one is often reported up and down every 10/15 mins even if (checked pinging directly from command line) no packet was really lost. It looks like the false positive problem is reported for the Host immediately before in the configuration file of the one really down. For example:
#Host reported flapping even if UP
[Host-A]
type=ping
host=172.x.y.z
tolerance=5
#Host DOWN since long time
[Host-B]
type=ping
host=172.x.y.z
tolerance=5
If I comment the Host-B configuration the problem disappear. My Python knowledge is very limited so I didn't go trough the code to find where the problem could be.
Thanks
Interesting; could you let me know what version you're using (and which Python version)?
Is it always the host above the failed one which flaps? Any feel for roughly how long "Host-B" would need to be down for the problem to manifest?
Hi, I'm using Python 3.9.6 on Windows Server 2016 standard. SimpleMonitor is 1.11.0
Is it always the host above the failed one which flaps? I think so.
Any feel for roughly how long "Host-B" would need to be down for the problem to manifest? It looks random. Also the up&down timing could be 15mins then 2mins ....
On Wed, Jan 26, 2022 at 1:46 PM James Seward @.***> wrote:
Interesting; could you let me know what version you're using (and which Python version)?
Is it always the host above the failed one which flaps? Any feel for roughly how long "Host-B" would need to be down for the problem to manifest?
— Reply to this email directly, view it on GitHub https://github.com/jamesoff/simplemonitor/issues/912#issuecomment-1022165817, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEL3V5K7CXSVGA5BKH6BTATUX7UMBANCNFSM5M25ZVAA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you authored the thread.Message ID: @.***>
Thanks for the info, I'll have a go at reproducing it. Hope the workaround of disabling/removing the long-term down host is ok for you for now.
Hi, I've realized today another issue probably related to the same bug. About 5 of 150+ hosts monitored report ping time 0.000ms which is impossible because hundreds of km away. As example here the logs related to same location:
2022-02-03 15:01:38+01:00 FR-Saint-Denis-VRRP: ok (0.000s) (Ping time 15.584ms)
2022-02-03 15:01:38+01:00 FR-Saint-Denis-LAN1: ok (0.000s) (Ping time 0.000ms)
2022-02-03 15:01:38+01:00 FR-Saint-Denis-LAN2: ok (0.000s) (Ping time 15.616ms)
2022-02-03 15:01:38+01:00 FR-Saint-Denis-L3: ok (0.000s) (Ping time 15.626ms)
2022-02-03 15:01:38+01:00 FR-Saint-Denis-LB1: ok (0.000s) (Ping time 0.000ms)
2022-02-03 15:01:38+01:00 FR-Saint-Denis-LB2: ok (0.000s) (Ping time 15.621ms)
I suspect it's a bug of ping3 maybe due to the fact I'm asking to ping 150+ host every 1 min and the time between pings is too short.
How do you manage that?
Agreed, that is odd. Not sure what's going on there, but if it's legit I want that network :)
Is it always those hosts?
Could you maybe try changing them to the host monitor? This is the original one for pinging hosts and works by actually running ping rather than being implemented in Python.
I've replace ping with host and it looks like all works as expected even if a host is down since 1 hour. To report no details about round trip on HTML page and logs. I've not set ping_regexp and time_regexp (default automatic)
FYI the output of the ping command on the server is
C:\>ping -n 1 -w 1000 172.20.51.1
Pinging 172.20.51.1 with 32 bytes of data:
Reply from 172.20.51.1: bytes=32 time=26ms TTL=56
Ping statistics for 172.20.51.1:
Packets: Sent = 1, Received = 1, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 26ms, Maximum = 26ms, Average = 26ms
Glad that's fixed the weird behaviour for those hosts. It should include the ping time in the detail field; I'll take a look to see if I can see why it isn't.
No difference using either ping or host. The temporary solution is disabling multithreading with -j 1
Thanks for the update. I'm also seeing this with a couple of my monitors recently (I have some kit unplugged so it's definitely not going to be up, despite what SimpleMonitor is occasionally reporting ;)
Interesting to know disabling multithreading helps, I'll have a look upstream at the library I'm using for it to see if there's any fix.
That didn't take long to track down; the library has an issue with multithreading: https://github.com/kyan001/ping3/issues/26
I wonder if I can support both (multithreading and correct pings) by keeping all the ping monitors on one thread 🤔