nav icon indicating copy to clipboard operation
nav copied to clipboard

Add support for Juniper CHASSIS and SYSTEM alerts

Open lunkwill42 opened this issue 2 years ago • 12 comments

Juniper devices have a concept of alerts, classified into Chassis alerts and system alerts.

Their SNMP MIBs support fetching a number of current alerts, but no details of what the alerts actually are.

The CNaaS team wants NAV to be able to report that alerts have been flagged, but some design is needed to figure out how this should work in NAV

Also needed:

  • [ ] #2432

lunkwill42 avatar Mar 11 '22 09:03 lunkwill42

@lunkwill42 I cannot find the word "alert" in the juniper mibs, do you mean notification, trap or alarm? If not, where is this described/documented?

hmpf avatar Mar 15 '22 13:03 hmpf

@hmpf I believe you want the JUNIPER-ALARM-MIB. It simply enumerates the number of present "red" alarms and "yellow" alarms in a device. There's a copy of it here: https://github.com/pgmillon/observium/blob/master/mibs/juniper/JUNIPER-ALARM-MIB

You may want to talk to Håvard E for some guidance, as Zino actually employs this MIB (and potentially others)...

lunkwill42 avatar Mar 16 '22 12:03 lunkwill42

Seems like step one is adding that mib to NAV's own library of mibs then :) Is there a howto for that?

hmpf avatar Mar 16 '22 13:03 hmpf

According to that MIB there are yellow alarms and red alarms, a count for each, whether the status is on, off or other, and a timestamp for when the status last changed. It might be relevant to mark whether these alarms are enabled or not as well (via jnxAlarmRelayMode: other, passOn, cutOff).

hmpf avatar Mar 16 '22 14:03 hmpf

Seems like step one is adding that mib to NAV's own library of mibs then :) Is there a howto for that?

Closest thing we have is this: https://nav.readthedocs.io/en/latest/hacking/adding-environment-probe-support.html?highlight=smidump#dumping-the-mib

lunkwill42 avatar Mar 17 '22 07:03 lunkwill42

I've asked Håvard E. about what zino does about this.

The JUNIPER-ALARM-MIB only exists on equipment that has a physical craft-interface installed, supplying one actual led to show the colors, and buttons to turn monitoring on or off for various subsystems. Older stuff generally have this interface but at least MX204 and MX10003 does not. There is still a counter for these alarms though, they're just not officially accessible via SNMP.

A workaround is to have a script on the equipment that periodically reads the values (show system alarms) and makes them available via a different MIB, the Utility MIB (mib-jnx-util).

gw> show system alarms | display xml 
<rpc-reply xmlns:junos="http://xml.juniper.net/junos/19.4R0/junos">
    <alarm-information xmlns="http://xml.juniper.net/junos/19.4R0/junos-alarm">
        <alarm-summary>
            <no-active-alarms/>
        </alarm-summary>
    </alarm-information>
    <cli>
        <banner>{master}</banner>
    </cli>
</rpc-reply>

{master}
gw>

hmpf avatar Mar 17 '22 12:03 hmpf

When using the mib-jnx-util, zino uses the OIDs jnxUtilUintValue.82.101.100.65.108.97.114.109 for the red alarm counter and jnxUtilUintValue.89.101.108.108.111.119.65.108.97.114.109 for the yellow alarm counter.

hmpf avatar Mar 17 '22 12:03 hmpf

The steps so far seem to be:

  • [x] Figure out how NAV should deal with this type of value/alarm
  • [x] Add the JUNIPER-ALARM-MIB Docs
  • [ ] Use the JUNIPER-ALARM-MIB if it is available on the equipment polled
  • [ ] Figure out some way to get the alarms from equipment that don't support that MIB, whether via mib-jnx-util or something else.

hmpf avatar Mar 17 '22 12:03 hmpf

#2368 adds the necessary documentation for some of the command line utilities that are useful for testing SNMP OID compatibility with NAV...

lunkwill42 avatar Mar 24 '22 14:03 lunkwill42

I've decided on making a new ipdevpoll-plugin just for these weirdos. It does not store the value, just dumps them into eventengine if not zero.

hmpf avatar Apr 04 '22 10:04 hmpf

If converting a mib to python with smidump with the -k flag, also increase the error level above 3 for the -l-flag.

smidump -k -l 5 -f python  ./A.mib > A.py

If it complains failed to locate MIB module foo, get the missing mibs and preload them with the -p-flag:

smidump -k -l 5 -f python  -p ./foo.mib ./A.mib > A.py

hmpf avatar Apr 06 '22 08:04 hmpf

After a design discussion with @knutvi, we have sort of concluded on a way forward.

For the CNaaS team, it's important to know at any one given time what the current number of yellow or red alerts in a Juniper chassis is.

I offered up some interpretations of this, and after some discussion, we decided that the cleanest implementation in a NAV context would be this design:

  1. When ipdevpoll detects that Juniper device D has >0 yellow alerts, it should post a start-state event to this effect, and include the alert count as a variable in the event varmap (the list of arbitrary attributes that can be attached to an event). Likewise, when it sees that Juniper device D has =0 yellow alerts, it should post a corresponding end-state event. This is close to what #2388 is currently honing in on.
  2. How to actually respond to these events is up to the eventengine, and a plugin P to handle these events needs to be written.
  3. When P receives a start event that is not deemed to be a duplicate, it should post a corresponding alert, and copy the alert count variable into the generated AlertHistory record.
  4. When P receives a start event for B that is deemed a duplicate, it should do some further verification: If the seemingly duplicate event has a different alert count then the existing AlertHistory entry, it should:
    1. Close the existing AlertHistory entry, but suppress a regular end-alert from being sent (or the end-user may end up receiving two notifications about the transition, rather than just one)
    2. Post a new pair of Alert and AlertHistory records with the new alert count.
  5. When P receives an end-event that is deemed to match an open AlertHistory record, that record should be closed.

This means that every change in the alert count that does not transition to a count of 0 should cause an entirely new AlertHistory state for B to be created, while any old ones are resolved. Any change in the alert count to 0 should resolve any existing corresponding AlertHistory states.

lunkwill42 avatar Jun 17 '22 13:06 lunkwill42