SONiC
SONiC copied to clipboard
Fault Management (Analysis and Handling)
Basic Information (context) Any failure (or an error) impacting a system/chassis or a sub-system is regarded as a fault. Broadly classified into SW (Software) and HW (Hardware) faults:
- SW faults are the ones that can occur during SW processing of a workflow at process/sub-system or a system level
- HW faults are those that can occur during SW or HW processing of a workflow at HW (board) level - e.g. HW component/device etc. They may occur at any of the following stages of system's functioning:
- system configuration, bring-up
- feature enablement/configuration
- during steady state
- feature disablement/unconfiguration
- while going-down (config reload, reboot etc.)
Present State In SONiC, Fault is represented via an Event or an Alarm. SONiC has Event Framework HLD which can help event-detector to publish its event to the eventD redisDB. However, there is no Fault Manager/Handler which can take the needed/ platform-specified action(s) to recover the system from the generated fault.
Need for this feature This feature aims at adding a generic FM (Fault Management) Infrastructure which can do the following:
- Abstract the platform/HWSKU nuances from an open source NOS (i.e. SONiC) by publishing platform-specific 'Fault-Action Policy table'
- Fetch these events (alarms/faults) from the eventD (based on published YANG/schema)
- Analyze them (in a generic way) against the above-mentioned Policy Table
- Take action based on the lookup/match in Policy Table Action could either be generic or platform specfic
Benefits Platform supplied 'Fault-Action Policy table' has a holistic/system-level view of the platform (chassis/board/HWSKU) and can gauge the right action required to recover from the fault. It can either go with the recommended action (provided by the fault source/detector) or override it with the system-level one.
HLD (md) PR: https://github.com/sonic-net/SONiC/pull/1527
Dell registered as the reviewer.
Community review recording https://zoom.us/rec/share/G4YPod_DoyMGGc8RG-A6jakAEOwR4INXe8pfG5IXrDKZS5ozbyghJyXASgEthkZq.8EmVrFbqX2t3qan3
@venkatmahalingam can you please let me know the github id of other reviewers from Dell? Thanks.
@bhaveshdell @rathnasabapathyv @prvattem