Pi.Alert [Feature Request] scan “retry” until mark as device down

Is there an option to state how many request before mark as device down?

In previous fork there is this option:

Scan Cycle: Select the scan cycle: 0, 1', 15'
Some devices do not respond to all ARP packets, for this cases is better to use a 15' cycle.
For Apple devices I recommend using 15' cycle

now there is only 0,1 has it been move somewhere or a different implementation for this feature?

thanks

Aug 25 '24 02:08 huuscript

+1, I would like this feature as well, some of my devices are unreliable and I get many false positives (down alerts) with them.

Thanks!

Aug 26 '24 13:08 stanelie

I can understand the request, but I would also like to point out the problems that are not seen here:

More retries mean significantly more time for the scan. What does an additional longer scan mean, as in the original project?

There are certainly several scans running at the same time (short scan, long scan)
WIFI-IoT devices or other “weak” clients that are connected to WIFI will have significant connection problems or even disconnections during the scans. This is not an assumption, but has been tested with Raspberry Zero and 2b via WIFI. A scan can therefore even be the cause if a device goes down.
If these scans run in parallel, they may slow each other down and therefore run in parallel for even longer and in increasing numbers
The database can be blocked if several scans try to write their results
The front end can become unusable if the arp scans build up because the database is no longer available for access

Who then has to deal with bug reports and troubleshooting messages? No, there will be no additional cycle. What I might still agree to would be a self-configurable “retry” parameter.

Aug 26 '24 20:08 leiweibau

ahh Yes i agree, now i understand why you remove the 15’ scan. Yes.. We need something like a setting which take more “retry” before report as down. I think retry are more useful then the scan interval anyway, as no mater which interval we set there is always a chance to ping right at dead time of the device. So let change the heading to retry instead of interval. Thanks for considering it.

Aug 26 '24 21:08 huuscript

In my case, there is no need for more scans running in parallel. All that is needed it a counter of failed attempts per devices. Each time a scan runs, the counter is incremented. Whenever the counter reaches the "allow x failed attempts" setting of the devices, a notification is sent. I accept that if the device fails, I will get a delayed notification that has x times (scan interval) total time delay.

Aug 27 '24 12:08 stanelie

I have the impression that there is a lack of clarity about how it works.

Once again for explanation: The arp-scan tool sends an ARP request to each IP of the local network, to which the corresponding device behind the IP responds with its Mac address. If the request times out, the request (retry) is sent again. If there is no response after the defined number of retries, there is no host behind the IP for the arp-scan tool. At the end, arp-scan displays a list of MAC addresses and IPs that have responded. This list is compared with the database. All devices found by arp-scan are online, the others are offline

Whenever the counter reaches the "allow x failed attempts" setting of the devices, a notification is sent.

This request makes no sense in view of how arp-scan works. Firstly, the tool does not indicate how many retries it has used per device and secondly, you end up with the message for the corresponding event for the device.

Aug 27 '24 13:08 leiweibau

Hum... Maybe I was not clear.

Let's say I've set a device to allow for one failure before saying the device is offline. Regardless of the number of retries from arpscan during a unique scan cycle, if the device fails to respond during this scan cycle, note that it has failed once but leave it's status as "online", and do not send the alert yet. Then, at the next arpscan cycle, if it fails again, and only then, send an alert. If it is back online, reset the "failed attempts" counter and leave it as "online".

Does this make sense?

Aug 27 '24 14:08 stanelie

@huuscript Do you and @stanelie mean the same thing, or are you talking about different things

Aug 27 '24 18:08 leiweibau

yes same thing. no need to change the scan cycle, just when to mark that device is down. Let say we have 2 scenarios. 1 1 1 1 1 o 1 1 1 1 o 1 1 1 o 1 o 1 1 1 1 o 1 1 1 1 1 1 1 1 1 1 1 o o o o o o o o o o o o o ^ the first scenario would have 5 offline notifications and second scenario would have one “offline” notification. if we can adjust the number “failed attempts” (let say 3 in above example at ^) then the first scenario would not show as “offline” and second scenario would show as offline at the third “failed attemps”

Aug 27 '24 22:08 huuscript

Not sure what the best approach is but I'd also like to see some (configurable) behavior for offline notifications.

Especially, my Shellys are unreliable, therefore I turned off the down notification. Usually, they work just fine. But a few days ago I realized one of them was offline for a few weeks (it is rarely used).

I'm my case I'd like to turn on the Down notification but I want to be notified only if the device is down for e.g. 3 scans in a row. Because what usually happens, it's down and up at the next scan and this can happen a dozen times per day.

So an additional counter in the DB might help with that? Don't know ...

Sep 09 '24 09:09 upD8R

Even though it may seem like I've forgotten about the topic, I'm still looking into it, but I'm still struggling with the approach. 😔

Nov 27 '24 09:11 leiweibau

I have found 2 possible ways of handling the situation.

pi.alert recognizes all online or offline statuses of an affected device, but suppresses the message until a threshold value is reached. This can be done globally (easy) or per device (hard)
Pi.Alert simply ignores a down event as long as the number of consecutive events remains below a threshold value. As long as the device is still online for Pi.Alert. This can also be done globally (easy) or per device (hard).

At the moment I am thinking about the 2nd variant in hard. With this approach, I don't have to play around in the notification workflow.

Feb 06 '25 19:02 leiweibau

Thank you looking into this.. I think its important for per device as some need more urgent attention and general very stable (servers, NASes) and others are less urgent but do go down intermittently (IOTs, smart apliances)

Feb 06 '25 23:02 huuscript

Brief interim report:

I call this feature “Scan Validation”. If “Scan Validation” is set to “1”, the process is as follows:

The 1st scan determines that the device is no longer online, but leaves the status as “online”. However, this “online” has a small marker in the form of “online*”.
With the 2nd scan in a row in which the device is offline, the device is marked as “offline” or “down”.
The default value for “Scan Validation” is “0”.
If you change the value to 2, for example, it takes a total of 3 scans until the device is recognized as offline, provided that the device is not online again in between. The initial scan and the 2 validations.

The pictures use the German language file, but everything should be self-explanatory.

Feb 08 '25 20:02 leiweibau

Looks pretty cool, thanks for you support!

Feb 08 '25 20:02 upD8R

Yes, looking good!, Thank you for your time.

Feb 08 '25 21:02 huuscript

Looks great!

Feb 11 '25 09:02 hspindel

I'm still testing the feature a bit, but everything looks pretty good at the moment. Due to the functionality that "shaky" devices are still marked as "online" during the validation phase, the diagram for the activity is smoothed out considerably. With the next release, I will initially only make the feature available for the main scan and extend it to the ICMP monitor with the following release.

I'm sorry that the implementation took so long, but the concern that I would have to change the notification workflow really put me off. The current approach is relatively easy to handle and less complicated.

Feb 11 '25 11:02 leiweibau

Update released with https://github.com/leiweibau/Pi.Alert/commit/b4ece9e550ba1fe62d137389ffbba5ce5ef84e8f

Feb 18 '25 15:02 leiweibau