SONiC icon indicating copy to clipboard operation
SONiC copied to clipboard

[PSU & system health] Support PSU power threshold checking

Open stephenxs opened this issue 3 years ago • 11 comments

PRs

PR title state context
[PSU] Add warning/critical thresholds for PSU power GitHub issue/pull request detail GitHub pull request check contexts
[PSU daemon] Support PSU power threshold checking GitHub issue/pull request detail GitHub pull request check contexts
[psushow & psuutil] Support PSU power threshold checking GitHub issue/pull request detail GitHub pull request check contexts
[system health daemon] Support PSU power threshold checking GitHub issue/pull request detail GitHub pull request check contexts
[Mellanox] Support PSU power threshold checking GitHub issue/pull request detail GitHub pull request check contexts
[Mellanox] Support PSU power threshold checking GitHub issue/pull request detail GitHub pull request check contexts

Support PSU power threshold checking

  • Two platform APIs are introduced to represent the warning and critical thresholds of a PSU's power
  • In the main loop
    • Whenever a PSU becomes good, PSU daemon tries calling platform APIs to fetch both thresholds. If None is returned or NotImplemented is thrown by either API, the PSU power threshold checking will not be performed for the PSU
    • the PSU daemon compares the power with thresholds, exposing status to the database, and logging messages.

Signed-off-by: Stephen Sun [email protected]

stephenxs avatar Aug 18 '22 09:08 stephenxs

@prgeor kinldy reminder to review and provide your feedback

liat-grozovik avatar Sep 28 '22 07:09 liat-grozovik

Conflicts resolved.

stephenxs avatar Oct 04 '22 01:10 stephenxs

@stephenxs please update the https://github.com/sonic-net/sonic-utilities/blob/master/doc/Command-Reference.md with the new command line.

zhangyanzhao avatar Oct 04 '22 15:10 zhangyanzhao

@stephenxs please update the https://github.com/sonic-net/sonic-utilities/blob/master/doc/Command-Reference.md with the new command line.

Will add them in https://github.com/sonic-net/sonic-utilities/pull/2326 once they are finalized.

stephenxs avatar Oct 04 '22 16:10 stephenxs

Regarding the comment raised from the review meeting, please compare chassis PSU designs w.r.t PSU budgets and capabilities: The PSU power budget is called periodically from the main loop for chassis-based system. The flow is:

  1. Calculate the sum of the maximum supplied power of all PSUs by calling psu.get_maximum_supplied_power() for each PSU and putting the result together
  2. Calculate the sum of power consumption of all power consumers, including modules, and fan drawers, by calling power_consumer.get_maximum_consumed_power() and putting the result together
  3. Compare both values and set LED/raise alarm accordingly

The PSU power threshold checking is called periodically from the function _update_single_psu_data for each PSU. It calls platform API introduced in this design: get_psu_power_warning_threshold and get_psu_power_critical_threshold.

Both functionalities are introduced for PSU power handling, but the flows, values operated on, and platform APIs differ between them.

  1. For PSU power budget, the power consumption is a static value. This is because we are able to control the power of each component in the chassis-based system. But for PSU power checking, we want to compare the runtime power with the power threshold. This is because
    • It is the motivation to compare the runtime power consumption.
    • We are not able to get the static power of the submodules in a pizza box system - there is no platform API for that and it is not supported by hardware either.
  2. As a result, the platform APIs are different.

Currently, both flows are clear and easy to understand. If we merged both functionalities together, we still need to use different flows to handle pizza-box/chassis systems. This makes the code complicated and difficult to understand and maintain. So we should still keep both flows independently from each other. @lguohan FYI.

stephenxs avatar Oct 07 '22 02:10 stephenxs

Reply to the community review comments:

  • Q: How is this PSU design different from chassis mgmt? Can chassis mgmt design be leveraged? A: replied here
  • Q: Community suggest to please compare chassis PSU designs w.r.t PSU budgets and capabilities. A: same as above
  • Q: Does the design introduce any new platform API's ? If yes please add a seperate section to describe it A: There is a section in the HLD for new platform API.
  • Q: what is the difference between max power threshold vs critical power threshold? How have these values been determined and set up ? A: platform vendor to provide the values. in case platform vender doesn't support this function, PSU daemon will skip all the logics that uses it.
  • Q: Community suggested to use Hysteresis graph to represent the critical and warning PSU thresholds A: added into the HLD
  • Q: Provide CLI command for users to overwrite hysteresis PSU thresholds to PSU daemon? A: After internal discussion, we do not suggest doing so because platform vendors have a better understanding of the hardware, therefore, knowing what values are the best. If we allow the user to set it, either greater or less value causes issues

stephenxs avatar Oct 14 '22 02:10 stephenxs

@jeffhtgt any further comments following the above discussion? @prgeor any comments from your side?

as this feature is for 202211, could you guys help to close the HLD and the PRs review?

liat-grozovik avatar Oct 18 '22 09:10 liat-grozovik

@liat-grozovik I have no further comments, thank you.

jeffhtgt avatar Oct 18 '22 15:10 jeffhtgt

@liat-grozovik I have no further comments, thank you.

Thanks. Can you then approve the PR from your POV?

liat-grozovik avatar Oct 19 '22 12:10 liat-grozovik

@shyam77git can you review from chassis point of view?

prgeor avatar Nov 10 '22 22:11 prgeor

@prgeor the PR is approved and build passed, are you ok to merge? Thanks.

zhangyanzhao avatar Nov 12 '22 02:11 zhangyanzhao

@stephenxs its not clear how is the user notified? if syslog how many times syslog is raised?

syslog. it is printed only once when the alarm is raised and cleared. PS. it's not relevant to the feature itself but is how it works in PSU daemon.

stephenxs avatar Nov 18 '22 00:11 stephenxs