sonic-buildimage
sonic-buildimage copied to clipboard
'sfputil firmware run' cmd needs better resilience and synchronization with PMON Xcvrd
Description
When 'sfputil firmware run
Steps to reproduce the issue
- For the simplest case, ensure the targeted interface has a CMIS module present (which properly supports FW upgrade procedure) and that the applicable interface is admin down (interface can also be set to admin up, which is even more likely to cause the problem to occur).
- Issue the 'sfputil firmware run
' command to induce the target transceiver module to reset and begin running its inactive (or active, when supported) firmware load. - Wait for interface to stabilize and then run the command again until such time as the 'sfputil firmware run
' command fails with a traceback.
Describe the results you received
Exactly what is described in paragraph 3 above.
Describe the results you expected
Optimally, transceiver module accesses would/should not fail after this command is issued. But, there is no guarantee that accesses to the module will complete successfully until such time as the module stabilizes post-reset.
Additional information you deem important
Two detailed annotated samples are provided below to show the progression of events involved here.
In the first sample, the 'sfputil firmware run' command appears to work fine and indicates a successful completion status. Even in this case, though, PMON Xcvrd is experiencing failures when simultaneously attempting to access the protagonist module.
In the second sample, the 'sfputil firmware run' command fails with a traceback when it issues a module read operation that doesn't complete successfully and the platform specific code returns value None (as specified by sfp_base.py, and due to the failed access).
PMON Xcvrd threads should not be attempting to access a module that has this command issued to it until such time as the module is understood to be operating normally again (and is prepared to sink accesses). As the transceiver subsystem architecture stands now, these Xcvrd threads may try to provision/de-provision the module datapath, solicit DDM/DOM data, or interact with the module in other ways.
Investigation and sample runs were conducted using Acacia ZR module target with the FW versions shown (at interface Ethernet80):
CMIS spec indicates that module behavior during the associated 'Run FW Image' reset is 'vendor and technology dependent', thus there can be no assumption that module can be accessed prior to quiescing post-reset. Spec further indicates that CMD 0041h: 'Firmware Management Features' is (should be) used to query firmware command performance attributes (for example, how long it may take maximally to execute commands).
Consistent with the above, the Acacia ZR/ZR+ documentation that we have here states that 'Before issuing/using FW download [including 0109h: Run Image], the host should issue CMD 0041h to familiarize itself with the features supported, and in particular the max timeout values.'
It may be necessary to engage module vendor(s) in order to understand the specific access restrictions in this area (during the associated 'Run Image' reset period), as it would appear that module behavior can/does change with different FW versions and also when contrasting non-hitless with hitless upgrade.
1st annotated sample run:
2nd annotated sample run:
Output of show version
Using 202205 branch...
admin@ixre-egl-board40:~$ show version
SONiC Software Version: SONiC.HEAD.600988-msft-2205-ndk-c854a6a2
SONiC OS Version: 11
Distribution: Debian 11.8
Kernel: 5.10.0-18-2-amd64
Build commit: c854a6a2
Build date: Mon Dec 18 22:37:27 UTC 2023
Built by: gitlab-runner@sonic-bld2
Platform: x86_64-nokia_ixr7250e_36x400g-r0
HwSKU: Nokia-IXR7250E-36x400G
ASIC: broadcom
ASIC Count: 2
Serial Number: EAG2-02-052
Model Number: N/A
Hardware Revision: 56
Uptime: 18:56:46 up 1 day, 2:49, 1 user, load average: 1.14, 1.32, 1.36
Date: Wed 20 Dec 2023 18:56:46
Docker images:
REPOSITORY TAG IMAGE ID SIZE
docker-orchagent HEAD.600988-msft-2205-ndk-c854a6a2 90241935b030 406MB
docker-orchagent latest 90241935b030 406MB
docker-fpm-frr HEAD.600988-msft-2205-ndk-c854a6a2 df6176897cc2 418MB
docker-fpm-frr latest df6176897cc2 418MB
docker-teamd HEAD.600988-msft-2205-ndk-c854a6a2 a2d7a8d56c83 389MB
docker-teamd latest a2d7a8d56c83 389MB
docker-macsec latest 5394fbb21224 391MB
docker-syncd-brcm-dnx HEAD.600988-msft-2205-ndk-c854a6a2 6353695a531f 718MB
docker-syncd-brcm-dnx latest 6353695a531f 718MB
docker-gbsyncd-broncos HEAD.600988-msft-2205-ndk-c854a6a2 d9ed266637ba 419MB
docker-gbsyncd-broncos latest d9ed266637ba 419MB
docker-gbsyncd-credo HEAD.600988-msft-2205-ndk-c854a6a2 ca22f0ad248b 392MB
docker-gbsyncd-credo latest ca22f0ad248b 392MB
docker-dhcp-relay latest f4f260277d3f 380MB
docker-snmp HEAD.600988-msft-2205-ndk-c854a6a2 21e34abe3852 422MB
docker-snmp latest 21e34abe3852 422MB
docker-platform-monitor HEAD.600988-msft-2205-ndk-c854a6a2 2a0cbbb6e240 460MB
docker-platform-monitor latest 2a0cbbb6e240 460MB
docker-router-advertiser HEAD.600988-msft-2205-ndk-c854a6a2 6f17ed3df048 372MB
docker-router-advertiser latest 6f17ed3df048 372MB
docker-lldp HEAD.600988-msft-2205-ndk-c854a6a2 39dcda273ae9 381MB
docker-lldp latest 39dcda273ae9 381MB
docker-mux HEAD.600988-msft-2205-ndk-c854a6a2 3c503b426df1 384MB
docker-mux latest 3c503b426df1 384MB
docker-database HEAD.600988-msft-2205-ndk-c854a6a2 c87f15bd9176 372MB
docker-database latest c87f15bd9176 372MB
docker-sonic-telemetry HEAD.600988-msft-2205-ndk-c854a6a2 66ddf943a2fe 453MB
docker-sonic-telemetry latest 66ddf943a2fe 453MB
docker-nat HEAD.600988-msft-2205-ndk-c854a6a2 0ca4b951be64 322MB
docker-nat latest 0ca4b951be64 322MB
docker-sflow HEAD.600988-msft-2205-ndk-c854a6a2 13b652bc661e 320MB
docker-sflow latest 13b652bc661e 320MB
docker-sonic-mgmt-framework HEAD.600988-msft-2205-ndk-c854a6a2 f8f1ca49f557 449MB
docker-sonic-mgmt-framework latest f8f1ca49f557 449MB
Additional comments
-
sfputil should not fail with a traceback when/if platform specific code returns value None (as prescribed by sfp_base.py) from read_eeprom method when a module read operation fails.
-
There is some measure of synchronization warranted with Xcvrd whereby Xcvrd threads are not attempting to access a module which is in parallel having this 109h: Run Image command executed against it (and is thus being reset).