sonic-platform-daemons
sonic-platform-daemons copied to clipboard
SONiC FM (Fault Mgmt) infrastructure -Base version
HLD PR: https://github.com/sonic-net/SONiC/pull/1527
Additional relevant code PR: https://github.com/sonic-net/sonic-utilities/pull/3154
Summary
This adds a generic FM infrastructure to SONiC for fault analysis and handling. Broadly comprising of following three entities
Description
- Faults (Events) publisher daemon (faultpubd) is spawned as a micro-service at Linux host, which keeps running and provides following w.r.t system events (faults, alarms):
- Overall workflow is generic (common to all events)
- Formulates certain fault events
- Subscribe to event-framework to receive events
- Runs an event_receive worker-thread which periodically looks for live incoming events on ZMQ channel
- On reception of an event, parses it against sonic-events*.yang and publishes (adds) it as a new EVENT_TABLE entry to redisDB
- This event can be then be consumed from redisDB by services/apps (such as faultmgrd) for event handling
Note: faultpubd functionality is part of a separate PR as this is needed until the following prerequisite is satisfied. Prerequisite: Certain code PRs (esp. committing sonic-event.yang and publishing events' redisDB) yet to be committed. Refer to https://github.com/sonic-net/SONiC/pull/1409
- Faults manager daemon (faultmgrd) runs as a micro-service at Linux host and performs following tasks:
- registers and listens to redisDB
- receives events and parses them against schema (sonic-event.yang)
- perform lookup for fault type & severity in fault_policy.json file to determine fault action
- fault_policy.json file comprises of generic and platform specific F-A (Fault-Action) blocks i.e. for a particular fault type & severity, what all action(s) are needed (to recover the system from the fault). It abstracts platform/HWSKU fault handling from the open source NOS (i.e. SONiC).
Motivation and Context
Objective of producing FM HLD and this code PR is two-fold: a) Every SONiC NOS deployment may not have External Controller to take the action upon fault occurrence. In that case, SONiC (with its underlying platform) is expected to take the required action to recover the system/chassis from the fault. b) Platform supplied 'Fault-Action Policy table' has a holistic/system-level view of the platform (chassis/board/HWSKU) and can gauge the right action required to recover the system from the fault. It can either go with the recommended action (provided by the FDR - fault source/detector) or override it with the system-level one.
Fault Manager module would serve the purpose of taking necessary action(s) to log and handle the faults. Its a new (infra) feature and planned for 202405. Not planned for any double commit.
How Has This Been Tested?
Please refer to attached logs where:
- Fault producer (faultpubd) injects and producers faults (as events)
- Fault Manager (faultmgrd) receives/reads (from DB), parses/analyzes them against fault-action policy table to determine needed actions (to recover from the fault)
Additional Information (Optional)
Following UT scenarios validated: UT scenario A: Injected temperature sensor fault (temp exceeded) by instrumenting thermalctld, published it via chassis redisDB EVENT_TABLE entry, received and processed it (based on fault_action policy table). UT scenario B: Injected events-monit (resource) 'monit periodic test', published it via chassis redisDB EVENT_TABLE entry, received and processed it (based on fault_action policy table).
UT logs (Fault injection, publishing, storing in redisDB; processing from redisDB to take needed action(s)): FM-thermal-fault-injection.txt FM-UT-logs.txt FM-redisDB-faults-events.txt
@shyam77git can you update the link to HLD in the PR description?
@Junchao-Mellanox @keboliu can you review?
@shyam77git can you update the link to HLD in the PR description?
Added HLD PR link at the top of the description section.
@shyam77git
Faults manager daemon (faultmgrd) runs as a micro-service at Linux host and performs following tasks:
registers and listens to redisDB
receives events and parses them against schema (sonic-event.yang)
perform lookup for fault type & severity in fault_policy.json file to determine fault action
I suggest to have this change in sonic-buildimage than in this repo as Fault manager is intended to run in host as a host service