SONiC
SONiC copied to clipboard
SONiC FM (Fault Manager) HLD
code PRs (corresponding to this FM HLD PR) Fault Manager daemon (faultmgrd): sonic-platform-daemons: https://github.com/sonic-net/sonic-platform-daemons/pull/421 Reboot: sonic-utilities repo: https://github.com/sonic-net/sonic-utilities/pull/3154
Basic Information (context) Any failure (or an error) impacting a system/chassis or a sub-system is regarded as a fault. Broadly classified into SW (Software) and HW (Hardware) faults:
- SW faults are the ones that can occur during SW processing of a workflow at process/sub-system or a system level
- HW faults are those that can occur during SW or HW processing of a workflow at HW (board) level - e.g. HW component/device etc. They may occur at any of the following stages of system's functioning:
- system configuration, bring-up
- feature enablement/configuration
- during steady state
- feature disablement/unconfiguration
- while going-down (config reload, reboot etc.)
Present State In SONiC, Fault is represented via an Event or an Alarm. SONiC has Event Framework HLD which can help event-detector to publish its event to the eventD redisDB. However, there is no Fault Manager/Handler which can take the needed/ platform-specified action(s) to recover the system from the generated fault.
Need for this feature This feature aims at adding a generic FM (Fault Management) Infrastructure which can do the following:
- Abstract the platform/HWSKU nuances from an open source NOS (i.e. SONiC) by publishing platform-specific 'Fault-Action Policy table'
- Fetch these events (alarms/faults) from the eventD (based on published YANG/schema)
- Analyze them (in a generic way) against the above-mentioned Policy Table
- Take action based on the lookup/match in Policy Table Action could either be generic or platform specfic
Benefits Platform supplied 'Fault-Action Policy table' has a holistic/system-level view of the platform (chassis/board/HWSKU) and can gauge the right action required to recover from the fault. It can either go with the recommended action (provided by the fault source/detector) or override it with the system-level one.
Perhaps you can add special handling to avoid endless reboots and shutdowns. Is there an option to disable this feature temporarily option to the action?
202405 release fork date is coming, can you please accelerate the code PR review and merge the PR by end of 5/30? Thanks.
@shyam77git can you please update the PR Description with the list of the Code PRs? also, in order to approve the code PRs we need a test plan review in sonic test group. was that done? can you share the test plan PR as well?
@liat-grozovik will help to follow-up with the reviewers, if no update, will defer it to future release
HLD is not approved, move to backlog