[➕ Feature]: Physical Topology
Is your feature request related to a problem? Please describe.
Our primary use case of Keep is for device alert correlation of a modest size physical network. Mostly routers, switches, etc. We employ multiple types of monitoring tools, including event based (i.e. syslog/graylog) and polling-type (nagios, SNMP, etc.) I would like Keep to use knowledge from NetBox to enrich alerts with metadata, represent relations in topology views, and provide context to AI for advanced correlation. Reducing alert noise for larger incidents (i.e. rack power loss) from multiple monitoring tools, specifically polling alerts, has been difficult. I.e. a Nagios config might have generated "monitor services" for a device and each port/interface which is triggered every 5min. If a rack looses power (rate, but it's happened), this might trigger 100's of alerts from nagios for every device and monitored interface in the rack. Correlating these into a single incident would be ideal.
Describe the solution you'd like
I would like Keep to use knowledge from NetBox to enrich alerts with metadata, represent relations in topology views, apply topology correlation, and provide context to AI for advanced correlation. Primarily to apply topology correlation.
We can currently enrich alerts with workflows to obtain information from NetBox. This is very useful.
Topology Correlation, according to the docs "helps correlate alerts based on your infrastructure’s topology" and "It automatically analyzes incoming alerts and their relationship to your infrastructure topology, creating incidents when multiple related services or components of an application are affected". I'd like topology correlation to creating incidents when multiple related NetBox DCIM objects are affected AND provide node graphs in the topology viewer for alerts/incidents. I.e. if multiple alerts (nagios, graylog, etc) are received for a specific device and multiple interfaces in a short timeframe, correlate them to a single incident. Or if many devices and device-interface alerts are received that correspond to a shared rack, correlate these. Likewise for a devices/interfaces under a site/location.
Having the NetBox plugin (or a more advanced one) that could feed the Topology Processor would be highly valuable in correlating events, especially for polling-based alerts. Additionally, representing this visually in a topology viewer would be highly insightful for engineers/operators. I.e. rack-x: [device-a...device-n]; device-a-interface-x -> line-axby -> device-b-interface-y
I could generate CSV or YAML to represent this, but our infrastructure and netbox data change fewquently, so a live connection would be preferred. I don't think Keep's "Service Topology" is a best fit for this. Perhaps an additional "Infrastructure Topology" or being able to create multiple categorized topologies. NetBox has extensive APIs, so a more advanced Keep-NetBox provider seems appropriate, but I'd assume it would require some additional evolution of the Topology modules within Keep.
If AI could leverage the alert info/history and netbox topology data, perhaps it could further enhance topology view (upstream/downstream devices/links) and incident severity/impact.
Describe alternatives you've considered
For the noise reduction, enrich all device/interface related alerts with netbox data for site, location, rack, device-type, device-role, device, device-interface, line where applicable Try to create manual correlation rule(s) that are time-limited (i.e. 5min) and group by site, location, rack, device successively
Additional context Add any other context or screenshots about the feature request here.
+1
+1, would be very helpful.