[PSU & system health] Support PSU power threshold checking
PRs
Support PSU power threshold checking
- Two platform APIs are introduced to represent the warning and critical thresholds of a PSU's power
- In the main loop
- Whenever a PSU becomes good, PSU daemon tries calling platform APIs to fetch both thresholds. If
Noneis returned orNotImplementedis thrown by either API, the PSU power threshold checking will not be performed for the PSU - the PSU daemon compares the power with thresholds, exposing status to the database, and logging messages.
- Whenever a PSU becomes good, PSU daemon tries calling platform APIs to fetch both thresholds. If
Signed-off-by: Stephen Sun [email protected]
@prgeor kinldy reminder to review and provide your feedback
Conflicts resolved.
@stephenxs please update the https://github.com/sonic-net/sonic-utilities/blob/master/doc/Command-Reference.md with the new command line.
@stephenxs please update the https://github.com/sonic-net/sonic-utilities/blob/master/doc/Command-Reference.md with the new command line.
Will add them in https://github.com/sonic-net/sonic-utilities/pull/2326 once they are finalized.
Regarding the comment raised from the review meeting, please compare chassis PSU designs w.r.t PSU budgets and capabilities:
The PSU power budget is called periodically from the main loop for chassis-based system. The flow is:
- Calculate the sum of the maximum supplied power of all PSUs by calling
psu.get_maximum_supplied_power()for each PSU and putting the result together - Calculate the sum of power consumption of all power consumers, including modules, and fan drawers, by calling
power_consumer.get_maximum_consumed_power()and putting the result together - Compare both values and set LED/raise alarm accordingly
The PSU power threshold checking is called periodically from the function _update_single_psu_data for each PSU. It calls platform API introduced in this design: get_psu_power_warning_threshold and get_psu_power_critical_threshold.
Both functionalities are introduced for PSU power handling, but the flows, values operated on, and platform APIs differ between them.
- For
PSU power budget, the power consumption is a static value. This is because we are able to control the power of each component in the chassis-based system. But forPSU power checking, we want to compare the runtime power with thepower threshold. This is because- It is the motivation to compare the runtime power consumption.
- We are not able to get the static power of the submodules in a pizza box system - there is no platform API for that and it is not supported by hardware either.
- As a result, the platform APIs are different.
Currently, both flows are clear and easy to understand. If we merged both functionalities together, we still need to use different flows to handle pizza-box/chassis systems. This makes the code complicated and difficult to understand and maintain. So we should still keep both flows independently from each other. @lguohan FYI.
Reply to the community review comments:
- Q: How is this PSU design different from chassis mgmt? Can chassis mgmt design be leveraged? A: replied here
- Q: Community suggest to please compare chassis PSU designs w.r.t PSU budgets and capabilities. A: same as above
- Q: Does the design introduce any new platform API's ? If yes please add a seperate section to describe it A: There is a section in the HLD for new platform API.
- Q: what is the difference between max power threshold vs critical power threshold? How have these values been determined and set up ? A: platform vendor to provide the values. in case platform vender doesn't support this function, PSU daemon will skip all the logics that uses it.
- Q: Community suggested to use Hysteresis graph to represent the critical and warning PSU thresholds A: added into the HLD
- Q: Provide CLI command for users to overwrite hysteresis PSU thresholds to PSU daemon? A: After internal discussion, we do not suggest doing so because platform vendors have a better understanding of the hardware, therefore, knowing what values are the best. If we allow the user to set it, either greater or less value causes issues
@jeffhtgt any further comments following the above discussion? @prgeor any comments from your side?
as this feature is for 202211, could you guys help to close the HLD and the PRs review?
@liat-grozovik I have no further comments, thank you.
@liat-grozovik I have no further comments, thank you.
Thanks. Can you then approve the PR from your POV?
@shyam77git can you review from chassis point of view?
@prgeor the PR is approved and build passed, are you ok to merge? Thanks.
@stephenxs its not clear how is the user notified? if syslog how many times syslog is raised?
syslog. it is printed only once when the alarm is raised and cleared. PS. it's not relevant to the feature itself but is how it works in PSU daemon.