sonic-buildimage icon indicating copy to clipboard operation
sonic-buildimage copied to clipboard

'sfputil firmware run' cmd needs better resilience and synchronization with PMON Xcvrd

Open snider-nokia opened this issue 1 year ago • 10 comments

Description

When 'sfputil firmware run ' command is invoked it causes the target transceiver module to reset, however Xcvrd is unaware that said operation is taking place so its threads may/will continue to attempt to access the module during the window of time prior to the module being restored to normal operational status.

Steps to reproduce the issue

  1. For the simplest case, ensure the targeted interface has a CMIS module present (which properly supports FW upgrade procedure) and that the applicable interface is admin down (interface can also be set to admin up, which is even more likely to cause the problem to occur).
  2. Issue the 'sfputil firmware run ' command to induce the target transceiver module to reset and begin running its inactive (or active, when supported) firmware load.
  3. Wait for interface to stabilize and then run the command again until such time as the 'sfputil firmware run ' command fails with a traceback.

Describe the results you received

Exactly what is described in paragraph 3 above.

Describe the results you expected

Optimally, transceiver module accesses would/should not fail after this command is issued. But, there is no guarantee that accesses to the module will complete successfully until such time as the module stabilizes post-reset.

Additional information you deem important

Two detailed annotated samples are provided below to show the progression of events involved here.

In the first sample, the 'sfputil firmware run' command appears to work fine and indicates a successful completion status. Even in this case, though, PMON Xcvrd is experiencing failures when simultaneously attempting to access the protagonist module.

In the second sample, the 'sfputil firmware run' command fails with a traceback when it issues a module read operation that doesn't complete successfully and the platform specific code returns value None (as specified by sfp_base.py, and due to the failed access).

PMON Xcvrd threads should not be attempting to access a module that has this command issued to it until such time as the module is understood to be operating normally again (and is prepared to sink accesses). As the transceiver subsystem architecture stands now, these Xcvrd threads may try to provision/de-provision the module datapath, solicit DDM/DOM data, or interact with the module in other ways.

Investigation and sample runs were conducted using Acacia ZR module target with the FW versions shown (at interface Ethernet80): image

CMIS spec indicates that module behavior during the associated 'Run FW Image' reset is 'vendor and technology dependent', thus there can be no assumption that module can be accessed prior to quiescing post-reset. Spec further indicates that CMD 0041h: 'Firmware Management Features' is (should be) used to query firmware command performance attributes (for example, how long it may take maximally to execute commands).

image

Consistent with the above, the Acacia ZR/ZR+ documentation that we have here states that 'Before issuing/using FW download [including 0109h: Run Image], the host should issue CMD 0041h to familiarize itself with the features supported, and in particular the max timeout values.'

It may be necessary to engage module vendor(s) in order to understand the specific access restrictions in this area (during the associated 'Run Image' reset period), as it would appear that module behavior can/does change with different FW versions and also when contrasting non-hitless with hitless upgrade.

1st annotated sample run:

image

Sfputil firmware run cmd and xcvrd dysfunction sample 1

2nd annotated sample run:

image

Sfputil firmware run cmd and xcvrd dysfunction sample 2

Output of show version

Using 202205 branch...

admin@ixre-egl-board40:~$ show version

SONiC Software Version: SONiC.HEAD.600988-msft-2205-ndk-c854a6a2
SONiC OS Version: 11
Distribution: Debian 11.8
Kernel: 5.10.0-18-2-amd64
Build commit: c854a6a2
Build date: Mon Dec 18 22:37:27 UTC 2023
Built by: gitlab-runner@sonic-bld2

Platform: x86_64-nokia_ixr7250e_36x400g-r0
HwSKU: Nokia-IXR7250E-36x400G
ASIC: broadcom
ASIC Count: 2
Serial Number: EAG2-02-052
Model Number: N/A
Hardware Revision: 56
Uptime: 18:56:46 up 1 day,  2:49,  1 user,  load average: 1.14, 1.32, 1.36
Date: Wed 20 Dec 2023 18:56:46

Docker images:
REPOSITORY                    TAG                                  IMAGE ID       SIZE
docker-orchagent              HEAD.600988-msft-2205-ndk-c854a6a2   90241935b030   406MB
docker-orchagent              latest                               90241935b030   406MB
docker-fpm-frr                HEAD.600988-msft-2205-ndk-c854a6a2   df6176897cc2   418MB
docker-fpm-frr                latest                               df6176897cc2   418MB
docker-teamd                  HEAD.600988-msft-2205-ndk-c854a6a2   a2d7a8d56c83   389MB
docker-teamd                  latest                               a2d7a8d56c83   389MB
docker-macsec                 latest                               5394fbb21224   391MB
docker-syncd-brcm-dnx         HEAD.600988-msft-2205-ndk-c854a6a2   6353695a531f   718MB
docker-syncd-brcm-dnx         latest                               6353695a531f   718MB
docker-gbsyncd-broncos        HEAD.600988-msft-2205-ndk-c854a6a2   d9ed266637ba   419MB
docker-gbsyncd-broncos        latest                               d9ed266637ba   419MB
docker-gbsyncd-credo          HEAD.600988-msft-2205-ndk-c854a6a2   ca22f0ad248b   392MB
docker-gbsyncd-credo          latest                               ca22f0ad248b   392MB
docker-dhcp-relay             latest                               f4f260277d3f   380MB
docker-snmp                   HEAD.600988-msft-2205-ndk-c854a6a2   21e34abe3852   422MB
docker-snmp                   latest                               21e34abe3852   422MB
docker-platform-monitor       HEAD.600988-msft-2205-ndk-c854a6a2   2a0cbbb6e240   460MB
docker-platform-monitor       latest                               2a0cbbb6e240   460MB
docker-router-advertiser      HEAD.600988-msft-2205-ndk-c854a6a2   6f17ed3df048   372MB
docker-router-advertiser      latest                               6f17ed3df048   372MB
docker-lldp                   HEAD.600988-msft-2205-ndk-c854a6a2   39dcda273ae9   381MB
docker-lldp                   latest                               39dcda273ae9   381MB
docker-mux                    HEAD.600988-msft-2205-ndk-c854a6a2   3c503b426df1   384MB
docker-mux                    latest                               3c503b426df1   384MB
docker-database               HEAD.600988-msft-2205-ndk-c854a6a2   c87f15bd9176   372MB
docker-database               latest                               c87f15bd9176   372MB
docker-sonic-telemetry        HEAD.600988-msft-2205-ndk-c854a6a2   66ddf943a2fe   453MB
docker-sonic-telemetry        latest                               66ddf943a2fe   453MB
docker-nat                    HEAD.600988-msft-2205-ndk-c854a6a2   0ca4b951be64   322MB
docker-nat                    latest                               0ca4b951be64   322MB
docker-sflow                  HEAD.600988-msft-2205-ndk-c854a6a2   13b652bc661e   320MB
docker-sflow                  latest                               13b652bc661e   320MB
docker-sonic-mgmt-framework   HEAD.600988-msft-2205-ndk-c854a6a2   f8f1ca49f557   449MB
docker-sonic-mgmt-framework   latest                               f8f1ca49f557   449MB

Additional comments

  • sfputil should not fail with a traceback when/if platform specific code returns value None (as prescribed by sfp_base.py) from read_eeprom method when a module read operation fails.

  • There is some measure of synchronization warranted with Xcvrd whereby Xcvrd threads are not attempting to access a module which is in parallel having this 109h: Run Image command executed against it (and is thus being reset).

snider-nokia avatar Dec 20 '23 19:12 snider-nokia