sonic-utilities icon indicating copy to clipboard operation
sonic-utilities copied to clipboard

Backup STATE_DB PORT_TABLE|Ethernet during warm-reboot

Open mihirpat1 opened this issue 1 year ago • 2 comments

What I did

Currently, entire PORT_TABLE in STATE_DB is being deleted during warm-reboot. Due to this, host_tx_ready changes to false after warm-reboot which causes the link to remain down.

How I did it

Backing up host_tx_ready, NPU_SI_SETTINGS_SYNC_STATUS and CMIS_REINIT_REQUIRED fields from `STATE_DB PORT_TABLE* during warm-reboot now.

How to verify it

Verified that host_tx_ready in STATE_DB PORT_TABLE is retained after warm-reboot and the link remains up. Also, ensured that the keys CMIS_REINIT_REQUIRED and NPU_SI_SETTINGS_SYNC_STATUS are retained after warm-reboot. Before warm-reboot

root@sonic:/home/admin# redis-cli -n 6 hgetall "PORT_TABLE|Ethernet0"
 1) "state"
 2) "ok"
 3) "netdev_oper_status"
 4) "up"
 5) "admin_status"
 6) "up"
 7) "mtu"
 8) "9100"
 9) "CMIS_REINIT_REQUIRED"
10) "false"
11) "NPU_SI_SETTINGS_SYNC_STATUS"
12) "NPU_SI_SETTINGS_DEFAULT"
13) "supported_speeds"
14) "40000,100000"
15) "supported_fecs"
16) "none,rs"
17) "host_tx_ready"
18) "true"
19) "speed"
20) "100000"
21) "fec"
22) "N/A"
root@sonic:/home/admin# 

After warm-reboot script backs up PORT_TABLE and deletes unwanted fields

root@sonic:/home/admin# redis-cli -n 6 hgetall "PORT_TABLE|Ethernet0"
1) "CMIS_REINIT_REQUIRED"
2) "false"
3) "NPU_SI_SETTINGS_SYNC_STATUS"
4) "NPU_SI_SETTINGS_DEFAULT"
5) "host_tx_ready"
6) "true"
root@sonic:/home/admin# 

After switch boot-up post warm-reboot

root@sonic:/home/admin# redis-cli -n 6 hgetall "PORT_TABLE|Ethernet0"
 1) "state"
 2) "ok"
 3) "netdev_oper_status"
 4) "up"
 5) "admin_status"
 6) "up"
 7) "mtu"
 8) "9100"
 9) "supported_speeds"
10) "40000,100000"
11) "supported_fecs"
12) "none,rs"
13) "CMIS_REINIT_REQUIRED"
14) "false"
15) "NPU_SI_SETTINGS_SYNC_STATUS"
16) "NPU_SI_SETTINGS_DEFAULT"
17) "host_tx_ready"
18) "true"
19) "speed"
20) "100000"
21) "fec"
22) "N/A"
root@sonic:/home/admin# 

Previous command output (if the output of a command-line utility has changed)

New command output (if the output of a command-line utility has changed)

mihirpat1 avatar Jan 05 '24 03:01 mihirpat1

Just curious, what was the reason to not back up the entire table? Is it because some of the fields (e.g. netdev_oper_status ) should be re-populated after warm-reboot?

longhuan-cisco avatar Feb 02 '24 02:02 longhuan-cisco

Just curious, what was the reason to not back up the entire table? Is it because some of the fields (e.g. netdev_oper_status ) should be re-populated after warm-reboot?

@longhuan-cisco - Yes, you are correct. Hence, we decided to preserve selected fields which xcvrd/OA cares about and delete other fields from STATE_DB.

mihirpat1 avatar Feb 02 '24 02:02 mihirpat1

As discussed, I tested the change from this PR, host_tx_ready gets retained properly after warm-reboot and link stays up (especially for those CMIS modules).

@mihirpat1 @prgeor Could you please continue on this PR for the remaining?

root@t0-dut:/home/cisco# show reboot-cause history
Name                 Cause        Time                             User    Comment
-------------------  -----------  -------------------------------  ------  ---------
2024_05_22_07_52_46  warm-reboot  Wed May 22 07:49:46 UTC 2024     cisco   N/A
...

May 22 07:55:26.154419 cmono-t0-dut NOTICE pmon#xcvrd[27]: XCVRD INIT: Wait for port config is done
May 22 07:55:26.156638 cmono-t0-dut NOTICE pmon#xcvrd[27]: XCVRD INIT: After port config is done
May 22 07:55:26.183632 cmono-t0-dut NOTICE pmon#xcvrd[27]: Start daemon main loop with thread count 3
May 22 07:55:26.183632 cmono-t0-dut NOTICE pmon#xcvrd[27]: Started thread CmisManagerTask
May 22 07:55:26.183675 cmono-t0-dut NOTICE pmon#xcvrd[27]: Started thread DomInfoUpdateTask
May 22 07:55:26.183675 cmono-t0-dut NOTICE pmon#xcvrd[27]: Started thread SfpStateUpdateTask
...
May 22 07:55:26.198509 cmono-t0-dut WARNING pmon#xcvrd[27]: $$$ Ethernet32 handle_port_update_event() : op=SET DB:STATE_DB Table:PORT_TABLE fvp {'host_tx_ready': 'true', 'state': 'ok', 'netdev_oper_status': 'up', 'admin_status': 'up', 'mtu': '9100', 'supported_speeds': '200000,400000', 'supported_fecs': 'rs', 'speed': '400000'}
May 22 07:55:26.198532 cmono-t0-dut WARNING pmon#xcvrd[27]: $$$ Ethernet56 handle_port_update_event() : op=SET DB:STATE_DB Table:PORT_TABLE fvp {'host_tx_ready': 'true', 'state': 'ok', 'netdev_oper_status': 'up', 'admin_status': 'up', 'mtu': '9100', 'supported_speeds': '200000,400000', 'supported_fecs': 'rs', 'speed': '400000'}
May 22 07:55:26.198554 cmono-t0-dut WARNING pmon#xcvrd[27]: $$$ Ethernet0 handle_port_update_event() : op=SET DB:STATE_DB Table:PORT_TABLE fvp {'host_tx_ready': 'false', 'state': 'ok', 'netdev_oper_status': 'down', 'admin_status': 'down', 'mtu': '9100', 'supported_speeds': '200000,400000', 'supported_fecs': 'rs'}
May 22 07:55:26.198577 cmono-t0-dut WARNING pmon#xcvrd[27]: $$$ Ethernet16 handle_port_update_event() : op=SET DB:STATE_DB Table:PORT_TABLE fvp {'host_tx_ready': 'true', 'state': 'ok', 'netdev_oper_status': 'up', 'admin_status': 'up', 'mtu': '9100', 'supported_speeds': '200000,400000', 'supported_fecs': 'rs', 'speed': '400000'}
May 22 07:55:26.198601 cmono-t0-dut WARNING pmon#xcvrd[27]: $$$ Ethernet128 handle_port_update_event() : op=SET DB:STATE_DB Table:PORT_TABLE fvp {'host_tx_ready': 'true', 'state': 'ok', 'netdev_oper_status': 'up', 'admin_status': 'up', 'mtu': '9100', 'supported_speeds': '200000,400000', 'supported_fecs': 'rs', 'speed': '400000'}
May 22 07:55:26.198618 cmono-t0-dut WARNING pmon#xcvrd[27]: $$$ Ethernet72 handle_port_update_event() : op=SET DB:STATE_DB Table:PORT_TABLE fvp {'host_tx_ready': 'true', 'state': 'ok', 'netdev_oper_status': 'up', 'admin_status': 'up', 'mtu': '9100', 'supported_speeds': '200000,400000', 'supported_fecs': 'rs', 'speed': '400000'}
May 22 07:55:26.198643 cmono-t0-dut WARNING pmon#xcvrd[27]: $$$ Ethernet120 handle_port_update_event() : op=SET DB:STATE_DB Table:PORT_TABLE fvp {'host_tx_ready': 'true', 'state': 'ok', 'netdev_oper_status': 'up', 'admin_status': 'up', 'mtu': '9100', 'supported_speeds': '200000,400000', 'supported_fecs': 'rs', 'speed': '400000'}
May 22 07:55:26.198661 cmono-t0-dut WARNING pmon#xcvrd[27]: $$$ Ethernet192 handle_port_update_event() : op=SET DB:STATE_DB Table:PORT_TABLE fvp {'host_tx_ready': 'true', 'state': 'ok', 'netdev_oper_status': 'up', 'admin_status': 'up', 'mtu': '9100', 'supported_speeds': '200000,400000', 'supported_fecs': 'rs', 'speed': '400000'}
May 22 07:55:26.198689 cmono-t0-dut WARNING pmon#xcvrd[27]: $$$ Ethernet200 handle_port_update_event() : op=SET DB:STATE_DB Table:PORT_TABLE fvp {'host_tx_ready': 'false', 'state': 'ok', 'netdev_oper_status': 'down', 'admin_status': 'down', 'mtu': '9100', 'supported_speeds': '200000,400000', 'supported_fecs': 'rs'}
May 22 07:55:26.198712 cmono-t0-dut WARNING pmon#xcvrd[27]: $$$ Ethernet176 handle_port_update_event() : op=SET DB:STATE_DB Table:PORT_TABLE fvp {'host_tx_ready': 'true', 'state': 'ok', 'netdev_oper_status': 'up', 'admin_status': 'up', 'mtu': '9100', 'supported_speeds': '200000,400000', 'supported_fecs': 'rs', 'speed': '400000'}
...

longhuan-cisco avatar May 30 '24 22:05 longhuan-cisco

@StormLiangMS @yxieca @bingwang-ms please cherry pick this to 202311. Need for warm reboot support for platforms using CMIS optics

prgeor avatar May 30 '24 23:05 prgeor

Cherry-pick PR to 202311: https://github.com/sonic-net/sonic-utilities/pull/3352

mssonicbld avatar Jun 03 '24 16:06 mssonicbld

@bingwang-ms we need this in 202405

prgeor avatar Jun 18 '24 23:06 prgeor

@prgeor Seems there is cherry-pick conflict. Please double check

bingwang-ms avatar Jun 18 '24 23:06 bingwang-ms

@prgeor Seems there is cherry-pick conflict. Please double check

@bingwang-ms I have removed the 202405 tags since this is already part of 202405.

mihirpat1 avatar Jun 18 '24 23:06 mihirpat1