sonic-buildimage
sonic-buildimage copied to clipboard
BCM J2C+ ASIC internal thermal sensor phantom temperature spike
Description
This problem has been seen on production J2C+ silicon and is tracked at BCM CSP #CS00012333604.
BCM has informally provided ASIC thermal sensor patch for SDK 6.5.24 (SAI_7.1.0_GA) and this patch alleviates the issue.
---------------------
We currently use bcmsh "show pvt" command to periodically solicit J2C+ internal thermal sensor readings. We do this approximately every 10 seconds and then parse up the command output for thermal monitoring purposes. Note however that the alternative method of using native SAI operation for reading these thermal sensor values would/will have no impact on the issue relayed below (the same situation exists when using native SAI method).
On occasion we see one of several J2C+ internal thermal sensors spiking high spontaneously.
We have seen this issue on multiple line cards (all using BCM88852_A1 devices) and have seen the issue on multiple sensors (including 1--FAB1, 2--FAB2, 3--FAB3, and 6--PRM). After the spiking sample iteration, the same sensor always returns what looks like a normal value at the next invocation of the command.
The problem, when it occurs, results in a very short duration (false) thermal concern which causes our thermal algorithm to temporarily increase chassis cooling. The situation then normalizes quickly after the impacted sensor returns to a normal value at the next sample iteration (~10 seconds later). Impact is minimal and the biggest issue is likely the potential for operator concern over a (false) thermal subsystem log message.
Frequency of occurrence is very sporadic, and we have probably seen this issue occur only a single time, on one LC, in a given week across all LC's in a fully populated chassis (that is, a single occurrence per chassis per week). Of course, we cannot guarantee frequency of occurrence in the field.
This issue was reported to BCM via CSP CS00012333604.
There is a J2 (not J2C+) errata item EID#8032 that precisely describes the above issue, however BCM initially indicated that the problem has never been seen nor reported on J2C+ and the referenced erratum has never made it into the J2C+ errata document.
Nevertheless, BCM has now indicated during CSP CS00012333604 conversation that the problem could also occur on J2C+. Indeed, we are quite sure that we have repeatedly witnessed this issue on J2C+.
Despite not having previously seen the issue on J2C+, BCM implemented a fix for this issue on J2C+ in SDK 6.5.25.
Sonic 202205 image is currently using SDK 6.5.24, however, and moving to SDK 6.5.25 would have impact to SAI. BCM provided us a patch (for SDK 6.5.24) and we have verified that the issue no longer occurs when running with this patch.
The question to MSFT is whether there is a desire for BCM to formally release the above patch (for use with 202205 and SDK 6.5.24) or would instead rather wait for the organic move to SDK >= 6.5.25 (and associated SAI).
---------------------
Assignee(s) TBD...
@snider-nokia Please update and close this issue as discussed.
MSFT has elected to sit tight with the current situation (under 202205 and SDK 6.5.24). Associated BCM CSP CS00012333604 has been closed.
We can reopen the BCM ticket if the problem starts occurring on deployed systems (running SDK 6.5.24) and if MSFT then determines it necessary to request the patch for formal release.
BCM has now been requested to formally release the patch to fix this issue under SDK 6.5.24 and has agreed to do so. The relevant conversation is excerpted below.
BCM has made the patch available as per the following detailed information. Excerpted patch related text is after snapshot in case that's also helpful:
Here is the information:
commit e16e1fb825a005591104c252d2c189ea237b0466 Author: sonicbld [email protected] Date: Tue Mar 26 13:07:47 2024 -0700
updated sai release version to 7.1.77.4
Update git submodules
* Update sdk-src/hsdk_6.5.24_SAI_7.1.0_GA from branch 'hsdk_6.5.24_SAI_7.1.0_GA'
to 183ee14b9930a0956e400b645d35137de0935d6f
- [SAI_BRANCH rel_ocp_sai_7_1] Backport JIRA SDK-272527 to rel_ocp_sai_7_1
JIRA# SDK-272527
Issue Summary: JR2 PVTMON temp readings has rare outlier values
Root Cause: SW fix for PVT
Fix Description: SW actions to reduce the outlier value by reading 5 times, and check if value difference > 5 oC, then reject the reading, and reading again.
Brcm SAI patch is available, re-opening this to track the SAI patch fix needed.
BCM patch was previously believed to be in the DMZ but actually was NOT due to a problem at BCM end. Problem was corrected yesterday and now the patch is indeed there.
Broadcom had an internal problem with posting this patch to the DMZ. That problem was reconciled on Wednesday, 4/10 and the patch is indeed now available at the DMZ.
Brcm SAI 7.1.66.7 posted in 202205 branch has the fix.