sonic-buildimage icon indicating copy to clipboard operation
sonic-buildimage copied to clipboard

Update for the procedures for insertion/hot swap of Switch Fabric Module(SFM) by using "config chassis modules shutdown/startup" commands

Open JunhongMao opened this issue 10 months ago • 3 comments

Why I did it

For the Nokia SONiC chassis procedures for insertion/hot swap of Switch Fabric Module(SFM), the previous solution was using the below commands.

sudo nokia_cmd set shutdown-sfm <SFM-Num/Physical-Slot>

This PR along with the below PR intend to add the below commands for the equivalent operations. https://github.com/nokia/sonic-platform/pull/6

sudo config chassis modules shutdown/startup <module name>
Work item tracking
  • Microsoft ADO (number only):

How I did it

  1. Add chassis_module_config.py and its service. The service starts up automatically. The example is below.
sudo systemctl status chassis-module.service
● chassis-module.service - Chassis module up & down operation
     Loaded: loaded (/lib/systemd/system/chassis-module.service; enabled-runtime; vendor preset: enabled)
     Active: active (running) since Fri 2024-04-05 19:57:25 UTC; 1h 5min ago
   Main PID: 8856 (python3)
      Tasks: 1 (limit: 38314)
     Memory: 16.2M
     CGroup: /system.slice/chassis-module.service
             └─8856 /usr/bin/python3 /usr/local/bin/chassis_module_config.py

Apr 05 19:57:25 ixre-cpm-chassis15 systemd[1]: Started Chassis module up & down operation.
  1. When the cli command "sudo config chassis modules startup/shutdown" runs, calls chassis_module_set_admin_state.py to do the related operations.

How to verify it

The below test was carried out on FABRIC-CARD3 module on the supervisor card.
1. Shutdown
sudo config chassis modules shutdown FABRIC-CARD3

2. Check the status to see if the FABRIC-CARD3 was down.
$ show chassis modules status
        Name             Description    Physical-Slot    Oper-Status    Admin-Status       Serial
------------  ----------------------  ---------------  -------------  --------------  -----------
...
FABRIC-CARD3             Unavailable                4          Empty            down          N/A

 
3. Start up the module
sudo config chassis modules startup FABRIC-CARD3

4. Check the status
$ show chassis modules status
        Name             Description    Physical-Slot    Oper-Status    Admin-Status       Serial
------------  ----------------------  ---------------  -------------  --------------  -----------
...
FABRIC-CARD3                    SFM4                4         Online              up  01214400362

5. To test if the operation is still valid when the system reboot. For example, first shut down, 
then after saving config and reboot, the module should keep shutdown status. 
$ sudo config save
Existing files will be overwritten, continue? [y/N]: y

Then check the status to see if the FABRIC-CARD3 was down.
$ show chassis modules status
        Name             Description    Physical-Slot    Oper-Status    Admin-Status       Serial
------------  ----------------------  ---------------  -------------  --------------  -----------
...
FABRIC-CARD3             Unavailable                4          Empty            down          N/A


Which release branch to backport (provide reason below if selected)

  • [ ] 201811
  • [ ] 201911
  • [ ] 202006
  • [ ] 202012
  • [ ] 202106
  • [ ] 202111
  • [x] 202205
  • [x] 202211
  • [x] 202305

Tested branch (Please provide the tested image version)

  • [x] 202205

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

JunhongMao avatar Apr 05 '24 21:04 JunhongMao

@arlakshm @judyjoseph For SFM module shutdown/startup process, we need to create a chassis_module_config.service which calls the chassis_module_config.py to subscribes and listen to the CHASSIS_MODULE table in CONFIG_DB. But the service requires/after the updategraph.service (in 202205). But updategraph.service file has been replaced by config-setup.service in Master branch. Now, we created a PR in Master and with the Check Mark applicable to 202205. Should we still use updategraph.service in the PR and fix it after the 202205 cherry-pick?

mlok-nokia avatar Apr 08 '24 14:04 mlok-nokia

@JunhongMao I understand that with this PR and https://github.com/nokia/sonic-platform/pull/6, trying to have the shut/start of SFM + swss/syncd processes in the nokia platform submodule.

Can we make this a bit more generic, like when user issue "sudo config chassis modules shutdown FABRIC-CARD3", we can have the implementation in sonic-utilities to start/stop swss/syncd systemd service + call nokia platform API to power up/down the corresponding card ?

In this way this command will have a sonic common implementation with a platform hook to really power up/down SFM.

judyjoseph avatar Apr 09 '24 06:04 judyjoseph

@JunhongMao I understand that with this PR and nokia/sonic-platform#6, trying to have the shut/start of SFM + swss/syncd processes in the nokia platform submodule.

Can we make this a bit more generic, like when user issue "sudo config chassis modules shutdown FABRIC-CARD3", we can have the implementation in sonic-utilities to start/stop swss/syncd systemd service + call nokia platform API to power up/down the corresponding card ?

In this way this command will have a sonic common implementation with a platform hook to really power up/down SFM.

Hi Judy, The following reasons is why we need to define a service file to subscribe the "CHASSIS_MODULE" tables to shutdown/startup a SFM and related swss/syncd services is that - when users shutdown a SFM and save the config file, then reboot the chassis. When chassis is booting and loading config, we need to keep the SFM and swss/syncd in the down state based on the configuration. Second, number of swss/syncd is associated with a particular SFM module could be different in different Vendor. That is why we were thinking let the Vendor API to shutdown/startup related swss/syncd and SFM card is more flexible and straight forward.

Should we have a call to talk about this? Thanks.

mlok-nokia avatar Apr 09 '24 15:04 mlok-nokia

@JunhongMao and @mlok-nokia, as discussed offline with update the PR will latest proposal.

arlakshm avatar Apr 19 '24 19:04 arlakshm

This PR https://github.com/sonic-net/sonic-buildimage/pull/18578

has been replaced by the below new PRS: https://github.com/nokia/sonic-platform/pull/6 https://github.com/sonic-net/sonic-utilities/pull/3283 https://github.com/sonic-net/sonic-platform-daemons/pull/475

JunhongMao avatar Apr 23 '24 00:04 JunhongMao