sonic-buildimage icon indicating copy to clipboard operation
sonic-buildimage copied to clipboard

BCM J2C+ ASIC internal thermal sensor phantom temperature spike

Open snider-nokia opened this issue 11 months ago • 7 comments

Description

This problem has been seen on production J2C+ silicon and is tracked at BCM CSP #CS00012333604.

BCM has informally provided ASIC thermal sensor patch for SDK 6.5.24 (SAI_7.1.0_GA)  and this patch alleviates the issue.

                ---------------------

We currently use bcmsh "show pvt" command to periodically solicit J2C+ internal thermal sensor readings. We do this approximately every 10 seconds and then parse up the command output for thermal monitoring purposes. Note however that the alternative method of using native SAI operation for reading these thermal sensor values would/will have no impact on the issue relayed below (the same situation exists when using native SAI method).

On occasion we see one of several J2C+ internal thermal sensors spiking high spontaneously.

We have seen this issue on multiple line cards (all using BCM88852_A1 devices) and have seen the issue on multiple sensors (including 1--FAB1, 2--FAB2, 3--FAB3, and 6--PRM). After the spiking sample iteration, the same sensor always returns what looks like a normal value at the next invocation of the command.

The problem, when it occurs, results in a very short duration (false) thermal concern which causes our thermal algorithm to temporarily increase chassis cooling. The situation then normalizes quickly after the impacted sensor returns to a normal value at the next sample iteration (~10 seconds later). Impact is minimal and the biggest issue is likely the potential for operator concern over a (false) thermal subsystem log message.

Frequency of occurrence is very sporadic, and we have probably seen this issue occur only a single time, on one LC, in a given week across all LC's in a fully populated chassis (that is, a single occurrence per chassis per week). Of course, we cannot guarantee frequency of occurrence in the field.

This issue was reported to BCM via CSP CS00012333604.

There is a J2 (not J2C+) errata item EID#8032 that precisely describes the above issue, however BCM initially indicated that the problem has never been seen nor reported on J2C+ and the referenced erratum has never made it into the J2C+ errata document.

Nevertheless, BCM has now indicated during CSP CS00012333604 conversation that the problem could also occur on J2C+. Indeed, we are quite sure that we have repeatedly witnessed this issue on J2C+.

Despite not having previously seen the issue on J2C+, BCM implemented a fix for this issue on J2C+ in SDK 6.5.25.

Sonic 202205 image is currently using SDK 6.5.24, however, and moving to SDK 6.5.25 would have impact to SAI. BCM provided us a patch (for SDK 6.5.24) and we have verified that the issue no longer occurs when running with this patch.

The question to MSFT is whether there is a desire for BCM to formally release the above patch (for use with 202205 and SDK 6.5.24) or would instead rather wait for the organic move to SDK >= 6.5.25 (and associated SAI).

                ---------------------

Assignee(s) TBD...

snider-nokia avatar Mar 07 '24 16:03 snider-nokia

@snider-nokia Please update and close this issue as discussed.

prabhataravind avatar Mar 13 '24 15:03 prabhataravind

MSFT has elected to sit tight with the current situation (under 202205 and SDK 6.5.24). Associated BCM CSP CS00012333604 has been closed.

We can reopen the BCM ticket if the problem starts occurring on deployed systems (running SDK 6.5.24) and if MSFT then determines it necessary to request the patch for formal release.

snider-nokia avatar Mar 13 '24 15:03 snider-nokia

BCM has now been requested to formally release the patch to fix this issue under SDK 6.5.24 and has agreed to do so. The relevant conversation is excerpted below.

image

snider-nokia avatar Mar 21 '24 18:03 snider-nokia

BCM has made the patch available as per the following detailed information. Excerpted patch related text is after snapshot in case that's also helpful:

image

Here is the information:

commit e16e1fb825a005591104c252d2c189ea237b0466 Author: sonicbld [email protected] Date: Tue Mar 26 13:07:47 2024 -0700

updated sai release version to 7.1.77.4

Update git submodules

* Update sdk-src/hsdk_6.5.24_SAI_7.1.0_GA from branch 'hsdk_6.5.24_SAI_7.1.0_GA'
  to 183ee14b9930a0956e400b645d35137de0935d6f
  -  [SAI_BRANCH rel_ocp_sai_7_1] Backport JIRA SDK-272527 to rel_ocp_sai_7_1

    JIRA# SDK-272527

    Issue Summary: JR2 PVTMON temp readings has rare outlier values

    Root Cause: SW fix for PVT

    Fix Description: SW actions to reduce the outlier value by reading 5 times, and check if value difference > 5 oC, then reject the reading, and reading again.

snider-nokia avatar Mar 26 '24 21:03 snider-nokia

Brcm SAI patch is available, re-opening this to track the SAI patch fix needed.

rlhui avatar Apr 03 '24 17:04 rlhui

BCM patch was previously believed to be in the DMZ but actually was NOT due to a problem at BCM end. Problem was corrected yesterday and now the patch is indeed there.

snider-nokia avatar Apr 10 '24 14:04 snider-nokia

Broadcom had an internal problem with posting this patch to the DMZ. That problem was reconciled on Wednesday, 4/10 and the patch is indeed now available at the DMZ.

snider-nokia avatar Apr 12 '24 14:04 snider-nokia

Brcm SAI 7.1.66.7 posted in 202205 branch has the fix.

rlhui avatar Apr 17 '24 17:04 rlhui