SONiC icon indicating copy to clipboard operation
SONiC copied to clipboard

SONiC FM (Fault Manager) HLD

Open shyam77git opened this issue 1 year ago • 5 comments

code PRs (corresponding to this FM HLD PR) Fault Manager daemon (faultmgrd): sonic-platform-daemons: https://github.com/sonic-net/sonic-platform-daemons/pull/421 Reboot: sonic-utilities repo: https://github.com/sonic-net/sonic-utilities/pull/3154

Basic Information (context) Any failure (or an error) impacting a system/chassis or a sub-system is regarded as a fault. Broadly classified into SW (Software) and HW (Hardware) faults:

  • SW faults are the ones that can occur during SW processing of a workflow at process/sub-system or a system level
  • HW faults are those that can occur during SW or HW processing of a workflow at HW (board) level - e.g. HW component/device etc. They may occur at any of the following stages of system's functioning:
  • system configuration, bring-up
  • feature enablement/configuration
  • during steady state
  • feature disablement/unconfiguration
  • while going-down (config reload, reboot etc.)

Present State In SONiC, Fault is represented via an Event or an Alarm. SONiC has Event Framework HLD which can help event-detector to publish its event to the eventD redisDB. However, there is no Fault Manager/Handler which can take the needed/ platform-specified action(s) to recover the system from the generated fault.

Need for this feature This feature aims at adding a generic FM (Fault Management) Infrastructure which can do the following:

  • Abstract the platform/HWSKU nuances from an open source NOS (i.e. SONiC) by publishing platform-specific 'Fault-Action Policy table'
  • Fetch these events (alarms/faults) from the eventD (based on published YANG/schema)
  • Analyze them (in a generic way) against the above-mentioned Policy Table
  • Take action based on the lookup/match in Policy Table Action could either be generic or platform specfic

Benefits Platform supplied 'Fault-Action Policy table' has a holistic/system-level view of the platform (chassis/board/HWSKU) and can gauge the right action required to recover from the fault. It can either go with the recommended action (provided by the fault source/detector) or override it with the system-level one.

shyam77git avatar Nov 29 '23 18:11 shyam77git

Perhaps you can add special handling to avoid endless reboots and shutdowns. Is there an option to disable this feature temporarily option to the action?

shiraez avatar Jan 24 '24 12:01 shiraez

202405 release fork date is coming, can you please accelerate the code PR review and merge the PR by end of 5/30? Thanks.

zhangyanzhao avatar May 06 '24 00:05 zhangyanzhao

@shyam77git can you please update the PR Description with the list of the Code PRs? also, in order to approve the code PRs we need a test plan review in sonic test group. was that done? can you share the test plan PR as well?

liat-grozovik avatar May 15 '24 16:05 liat-grozovik

@liat-grozovik will help to follow-up with the reviewers, if no update, will defer it to future release

zhangyanzhao avatar May 22 '24 16:05 zhangyanzhao

HLD is not approved, move to backlog

zhangyanzhao avatar Jun 04 '24 17:06 zhangyanzhao