sonic-mgmt icon indicating copy to clipboard operation
sonic-mgmt copied to clipboard

Add extra logic in test_sfputil to make sure modules are restored

Open longhuan-cisco opened this issue 1 year ago • 14 comments

Description of PR

Summary: Fixes # (issue)

Add extra logic (common for both CMIS and non-CMIS) in test_sfputil to make sure modules are restored fully after sfp reset, this is especially helpful for restoring CMIS modules which requires config/CMIS_state_machine to be replayed after sfp reset.

Type of change

  • [ ] Bug fix
  • [ ] Testbed and Framework(new/improvement)
  • [ ] Test case(new/improvement)

Back port request

  • [ ] 201911
  • [ ] 202012
  • [ ] 202205
  • [ ] 202305
  • [ ] 202311

Approach

What is the motivation for this PR?

How did you do it?

How did you verify/test it?

Verified the testcase passed on setup which has 400G modules, and verified syslog the config/CMIS state machine was replayed properly after reset, modules were fully restored.

root@cmono-t0-dut:/home/cisco# show int trans info Ethernet8
Ethernet8: SFP EEPROM detected
...
        Application Advertisement: 400GAUI-8 C2M (Annex 120E) - Host Assign (0x1) - 400GBASE-DR4 (Cl 124) - Media Assign (0x1)
                                   100GAUI-2 C2M (Annex 135G) - Host Assign (0x55) - 100GBASE-DR (Cl 140) - Media Assign (0xf)
                                   200GAUI-4 C2M (Annex 120E) - Host Assign (0x11) - Undefined - Media Assign (0x5)
        CMIS Rev: 4.0
        Connector: MPO 1x12
        Encoding: N/A
        Extended Identifier: Power Class 4 (8.0W Max)
        Extended RateSelect Compliance: N/A
        Host Lane Count: 8
        Identifier: QSFP-DD Double Density 8X Pluggable Transceiver
...
platform_tests/sfp/test_sfputil.py::test_check_sfputil_reset[cmono-t0-dut-None] PASSED                                  [100%]

====================================================== warnings summary =======================================================
../../usr/local/lib/python3.8/dist-packages/paramiko/transport.py:236
  /usr/local/lib/python3.8/dist-packages/paramiko/transport.py:236: CryptographyDeprecationWarning: Blowfish has been deprecated
    "class": algorithms.Blowfish,

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
------------------------------- generated xml file: /data/tests/logs/tr_2024-06-03-18-13-32.xml -------------------------------
========================================== 1 passed, 1 warning in 1434.78s (0:23:54) ==========================================
INFO:root:Can not get Allure report URL. Please check logs

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

longhuan-cisco avatar Jun 03 '24 19:06 longhuan-cisco

The pre-commit check detected issues in the files touched by this pull request. The pre-commit check is a mandatory check, please fix detected issues.

Detailed pre-commit check results: trim trailing whitespace.................................................Passed
fix end of files.........................................................Passed
check yaml...........................................(no files to check)Skipped
check for added large files..............................................Passed
check python ast.........................................................Passed
flake8...................................................................Failed
- hook id: flake8
- exit code: 1

tests/platform_tests/sfp/test_sfputil.py:208:13: E125 continuation line with same indent as next logical line

flake8...............................................(no files to check)Skipped
check conditional mark sort..........................(no files to check)Skipped

To run the pre-commit checks locally, you can follow below steps:

  1. Ensure that default python is python3. In sonic-mgmt docker container, default python is python2. You can run the check by activating the python3 virtual environment in sonic-mgmt docker container or outside of sonic-mgmt docker container.
  2. Ensure that the pre-commit package is installed:
sudo pip install pre-commit
  1. Go to repository root folder
  2. Install the pre-commit hooks:
pre-commit install
  1. Use pre-commit to check staged file:
pre-commit
  1. Alternatively, you can check committed files using:
pre-commit run --from-ref <commit_id> --to-ref <commit_id>

mssonicbld avatar Jun 03 '24 19:06 mssonicbld

The pre-commit check detected issues in the files touched by this pull request. The pre-commit check is a mandatory check, please fix detected issues.

Detailed pre-commit check results: trim trailing whitespace.................................................Passed
fix end of files.........................................................Passed
check yaml...........................................(no files to check)Skipped
check for added large files..............................................Passed
check python ast.........................................................Passed
flake8...................................................................Failed
- hook id: flake8
- exit code: 1

tests/platform_tests/sfp/test_sfputil.py:18:1: F811 redefinition of unused 'wait_until' from line 16

flake8...............................................(no files to check)Skipped
check conditional mark sort..........................(no files to check)Skipped

To run the pre-commit checks locally, you can follow below steps:

  1. Ensure that default python is python3. In sonic-mgmt docker container, default python is python2. You can run the check by activating the python3 virtual environment in sonic-mgmt docker container or outside of sonic-mgmt docker container.
  2. Ensure that the pre-commit package is installed:
sudo pip install pre-commit
  1. Go to repository root folder
  2. Install the pre-commit hooks:
pre-commit install
  1. Use pre-commit to check staged file:
pre-commit
  1. Alternatively, you can check committed files using:
pre-commit run --from-ref <commit_id> --to-ref <commit_id>

mssonicbld avatar Jun 11 '24 06:06 mssonicbld

Hi @mihirpat1 , there are some questions left in the PR, can you have a look?

wsycqyz avatar Jun 25 '24 05:06 wsycqyz

Hi @mihirpat1 , there are some questions left in the PR, can you have a look?

@wsycqyz Can you please share the open questions for me?

mihirpat1 avatar Jun 25 '24 06:06 mihirpat1

The below two pre-commit check failures are due to baseline, because they started to happen right after merging master to local branch https://github.com/sonic-net/sonic-mgmt/pull/13108/commits/b0dd014f6bba79df0a9384e703499b1907b4656b: https://github.com/sonic-net/sonic-mgmt/pull/13108/checks?check_run_id=27298781282 https://github.com/sonic-net/sonic-mgmt/pull/13108/checks?check_run_id=27298780950

longhuan-cisco avatar Jul 11 '24 00:07 longhuan-cisco

@kevinwangsk please help with review / approval. Thanks!

jamesan47 avatar Jul 11 '24 21:07 jamesan47

@longhuan-cisco Can you please test the changes on a device with breakout cable and ensure all subports are linked up after one of the subport has gone through reset. Also, please ensure that the DOM config is disabled only for the first subport of a breakout group.

@mihirpat1 Updated code to only disable DOM for first subport of breakout group, and tested it.

longhuan-cisco avatar Aug 26 '24 08:08 longhuan-cisco

The pre-commit sanity check failure Azure.sonic-mgmt (Test onboarding t1 testcases by Elastictest - optional) seems to be some baseline issue, also saw it on other PRs': https://github.com/sonic-net/sonic-mgmt/pull/14474/checks?check_run_id=29863617790 https://github.com/sonic-net/sonic-mgmt/pull/14472/checks?check_run_id=29853135820

longhuan-cisco avatar Sep 09 '24 17:09 longhuan-cisco

@prgeor for viz @longhuan-cisco I am seeing that after executing the test_check_sfputil_reset testcase individually, a config reload is being triggered due to a failure seen while executing core_dump_and_config_check. Are you observing a similar behavior too?

Based on my understanding, this config reload is being triggered due to config_db_check_pass being set to False. This is because the current running_config has "dom_polling":"enabled" key-value introduced to PORT table of CONFIG_DB since the testcase disables and enables DOM. By default, transceiver DOM polling is enabled but the dom_polling field may not be present in PORT table of CONFIG_DB.

@mihirpat1 Yes, I observed the same: testcase passed, but there's a WARNING on config_check due to that dom_polling=enabled added after restoring the port, and config reload is triggered as a result.

Hence, can we ensure that the config reload is not triggered if

  1. pre-running config does not have dom_polling field present
  2. AND the dom_polling field is set to enabled in running-config post testcase execution

To avoid the config reload: Were you suggesting to change the logic in conftest.py:core_dump_and_config_check() to exclude checking dom_polling field, if there's no dom_polling present in port config pre-test AND there's dom_polling=enabled in port config post-test? yea, this would work.

Alternatively, simpler approach might be (although a little bit "raw"): we may clear this filed by sonic-db-cli CONFIG_DB HDEL 'PORT|Ethernet0' 'dom_polling' in the end for those relevant interfaces.
what do you think?

longhuan-cisco avatar Sep 09 '24 20:09 longhuan-cisco

@prgeor for viz @longhuan-cisco I am seeing that after executing the test_check_sfputil_reset testcase individually, a config reload is being triggered due to a failure seen while executing core_dump_and_config_check. Are you observing a similar behavior too? Based on my understanding, this config reload is being triggered due to config_db_check_pass being set to False. This is because the current running_config has "dom_polling":"enabled" key-value introduced to PORT table of CONFIG_DB since the testcase disables and enables DOM. By default, transceiver DOM polling is enabled but the dom_polling field may not be present in PORT table of CONFIG_DB.

@mihirpat1 Yes, I observed the same: testcase passed, but there's a WARNING on config_check due to that dom_polling=enabled added after restoring the port, and config reload is triggered as a result.

Hence, can we ensure that the config reload is not triggered if

  1. pre-running config does not have dom_polling field present
  2. AND the dom_polling field is set to enabled in running-config post testcase execution

To avoid the config reload: Were you suggesting to change the logic in conftest.py:core_dump_and_config_check() to exclude checking dom_polling field, if there's no dom_polling present in port config pre-test AND there's dom_polling=enabled in port config post-test? yea, this would work.

Alternatively, simpler approach might be (although a little bit "raw"): we may clear this filed by sonic-db-cli CONFIG_DB HDEL 'PORT|Ethernet0' 'dom_polling' in the end for those relevant interfaces. what do you think?

@longhuan-cisco I think fixing in this section will be more gene

@prgeor for viz @longhuan-cisco I am seeing that after executing the test_check_sfputil_reset testcase individually, a config reload is being triggered due to a failure seen while executing core_dump_and_config_check. Are you observing a similar behavior too? Based on my understanding, this config reload is being triggered due to config_db_check_pass being set to False. This is because the current running_config has "dom_polling":"enabled" key-value introduced to PORT table of CONFIG_DB since the testcase disables and enables DOM. By default, transceiver DOM polling is enabled but the dom_polling field may not be present in PORT table of CONFIG_DB.

@mihirpat1 Yes, I observed the same: testcase passed, but there's a WARNING on config_check due to that dom_polling=enabled added after restoring the port, and config reload is triggered as a result.

Hence, can we ensure that the config reload is not triggered if

  1. pre-running config does not have dom_polling field present
  2. AND the dom_polling field is set to enabled in running-config post testcase execution

To avoid the config reload: Were you suggesting to change the logic in conftest.py:core_dump_and_config_check() to exclude checking dom_polling field, if there's no dom_polling present in port config pre-test AND there's dom_polling=enabled in port config post-test? yea, this would work.

Alternatively, simpler approach might be (although a little bit "raw"): we may clear this filed by sonic-db-cli CONFIG_DB HDEL 'PORT|Ethernet0' 'dom_polling' in the end for those relevant interfaces. what do you think?

@longhuan-cisco We can proceed with the alternative aproach (clearing the dom_polling field after enabling DOM)

mihirpat1 avatar Sep 09 '24 23:09 mihirpat1

@mihirpat1 can you review

prgeor avatar Oct 17 '24 15:10 prgeor

@mihirpat1 can you review

@longhuan-cisco Can you please help in addressing the below per our discussion? https://github.com/sonic-net/sonic-mgmt/pull/13108#discussion_r1785387333

mihirpat1 avatar Oct 17 '24 15:10 mihirpat1

@longhuan-cisco Can you please help in addressing the below per our discussion? #13108 (comment)

@mihirpat1 I addressed as what we discussed.

And I also added logic to skip config interface shudown/startup for already admin-down ports, because of two reasons: 1) no need to shutdown already admin-down ports; 2) to avoid unexpectedly startup pre-test admin-down port, which can cause config check failed. This skip is especially needed, on certain test topologies/configurations, where some of the ports in admin-down state always:

     Interface                                    Lanes    Speed    MTU    FEC    Alias            Vlan    Oper    Admin                                             Type    Asym PFC
--------------  ---------------------------------------  -------  -----  -----  -------  --------------  ------  -------  -----------------------------------------------  ----------
     Ethernet0  2304,2305,2306,2307,2308,2309,2310,2311     400G   9100    N/A     etp0          routed    down     down  QSFP-DD Double Density 8X Pluggable Transceiver         off
     Ethernet8  2320,2321,2322,2323,2324,2325,2326,2327     400G   9100    N/A     etp1           trunk      up       up  QSFP-DD Double Density 8X Pluggable Transceiver         off
    Ethernet16  2312,2313,2314,2315,2316,2317,2318,2319     400G   9100    N/A     etp2           trunk      up       up  QSFP-DD Double Density 8X Pluggable Transceiver         off
...
   Ethernet184                          0,1,2,3,4,5,6,7     400G   9100    N/A    etp23           trunk      up       up  QSFP-DD Double Density 8X Pluggable Transceiver         off
   Ethernet192          256,257,258,259,260,261,262,263     400G   9100    N/A    etp24           trunk      up       up  QSFP-DD Double Density 8X Pluggable Transceiver         off
   Ethernet200                    8,9,10,11,12,13,14,15     400G   9100    N/A    etp25          routed    down     down  QSFP-DD Double Density 8X Pluggable Transceiver         off
   Ethernet208  1024,1025,1026,1027,1028,1029,1030,1031     400G   9100    N/A    etp26          routed    down     down  QSFP-DD Double Density 8X Pluggable Transceiver         off
   Ethernet216          768,769,770,771,772,773,774,775     400G   9100    N/A    etp27          routed    down     down  QSFP-DD Double Density 8X Pluggable Transceiver         off
...

longhuan-cisco avatar Oct 28 '24 23:10 longhuan-cisco

@prgeor, Mihir has reviewed the PR. Can u pls merge this.

bmridul avatar Nov 04 '24 19:11 bmridul

@Junchao-Mellanox could you resolve the conversation if your comments are addressed?

kevinskwang avatar Nov 13 '24 01:11 kevinskwang

@longhuan-cisco PR conflicts with 202405 branch

mssonicbld avatar Nov 13 '24 03:11 mssonicbld

@longhuan-cisco pls address the conflicts on 202405.

kevinskwang avatar Nov 14 '24 00:11 kevinskwang

@kevinskwang I've addressed conflicts and raised cherry-pick PR already https://github.com/sonic-net/sonic-mgmt/pull/15543

longhuan-cisco avatar Nov 14 '24 01:11 longhuan-cisco

@longhuan-cisco Hi, we need this changes also for 202311 branch, can you please also make PR for 202311 ?

eyakubch avatar May 21 '25 14:05 eyakubch

@longhuan-cisco Hi, we need this changes also for 202311 branch, can you please also make PR for 202311 ?

@eyakubch Could you please check with them if it's OK to add Request for 202311 branch tag? Normally that tag is needed for raising cherry-pick.

longhuan-cisco avatar May 28 '25 00:05 longhuan-cisco

@longhuan-cisco whom I need to ask ?

eyakubch avatar May 28 '25 08:05 eyakubch