sonic-mgmt icon indicating copy to clipboard operation
sonic-mgmt copied to clipboard

Enhance log processing for sonic neighbors

Open andywongarista opened this issue 1 year ago • 22 comments

Description of PR

Summary: Fixes # (issue)

Type of change

  • [ ] Bug fix
  • [ ] Testbed and Framework(new/improvement)
  • [x] Test case(new/improvement)

Back port request

  • [ ] 202012
  • [ ] 202205
  • [ ] 202305
  • [x] 202311
  • [x] 202405

Approach

What is the motivation for this PR?

Current processing involves iterating through sonic logs (which can be large) from the beginning, which is unnecessary since only log lines starting from a particular timestamp are relevant.

How did you do it?

Optimize this processing by doing it in reverse and stopping after the last relevant timestamp.

How did you verify/test it?

Ran test_upgrade_path with SONiC neighbors, verified in warm-reboot.log that SSH threads no longer hang unnecessarily long due to log processing

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

andywongarista avatar Jun 15 '24 00:06 andywongarista

The pre-commit check detected issues in the files touched by this pull request. The pre-commit check is a mandatory check, please fix detected issues.

Detailed pre-commit check results: trim trailing whitespace.................................................Passed
fix end of files.........................................................Passed
check yaml...........................................(no files to check)Skipped
check for added large files..............................................Passed
check python ast.........................................................Passed
flake8...................................................................Failed
- hook id: flake8
- exit code: 1

ansible/roles/test/files/ptftests/sonic.py:184:32: E201 whitespace after '('
ansible/roles/test/files/ptftests/sonic.py:237:5: E303 too many blank lines (2)

flake8...............................................(no files to check)Skipped
check conditional mark sort..........................(no files to check)Skipped

To run the pre-commit checks locally, you can follow below steps:

  1. Ensure that default python is python3. In sonic-mgmt docker container, default python is python2. You can run the check by activating the python3 virtual environment in sonic-mgmt docker container or outside of sonic-mgmt docker container.
  2. Ensure that the pre-commit package is installed:
sudo pip install pre-commit
  1. Go to repository root folder
  2. Install the pre-commit hooks:
pre-commit install
  1. Use pre-commit to check staged file:
pre-commit
  1. Alternatively, you can check committed files using:
pre-commit run --from-ref <commit_id> --to-ref <commit_id>

mssonicbld avatar Jun 15 '24 00:06 mssonicbld

Would you please fix the pre-commit check?

wsycqyz avatar Jun 17 '24 01:06 wsycqyz

/azpw run Azure.sonic-mgmt

andywongarista avatar Jun 18 '24 17:06 andywongarista

/AzurePipelines run Azure.sonic-mgmt

mssonicbld avatar Jun 18 '24 17:06 mssonicbld

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines[bot] avatar Jun 18 '24 17:06 azure-pipelines[bot]

@saiarcot895 FYI

wsycqyz avatar Jun 20 '24 02:06 wsycqyz

Seeing failure in test_advanced_reboot.py::test_warm_reboot in azure pipeline due to a KeyError, but locally I am seeing test failure due to TimeoutError, with AND without my change. Is this the case on your end?

logs.tar.gz

andywongarista avatar Jun 22 '24 00:06 andywongarista

The failure in the pipeline is with sonic neighbors, but it looks like the logs from your run are with cEOS neighbors.

saiarcot895 avatar Jun 22 '24 05:06 saiarcot895

The failure in the pipeline is with sonic neighbors, but it looks like the logs from your run are with cEOS neighbors.

Thanks, this TimeoutError was indeed due to accidentally running with cEOS neighbors. I reran with sonic neighbors and am seeing the test pass. logs.tar.gz

andywongarista avatar Jun 25 '24 18:06 andywongarista

/azp run Azure.sonic-mgmt

StormLiangMS avatar Jun 27 '24 04:06 StormLiangMS

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines[bot] avatar Jun 27 '24 04:06 azure-pipelines[bot]

hi @saiarcot895 any concern to approve?

StormLiangMS avatar Jun 27 '24 04:06 StormLiangMS

@StormLiangMS main concern is something is breaking the kvmtest-t0-sonic test case.

saiarcot895 avatar Jun 28 '24 21:06 saiarcot895

/azpw run Azure.sonic-mgmt

andywongarista avatar Jul 05 '24 22:07 andywongarista

/AzurePipelines run Azure.sonic-mgmt

mssonicbld avatar Jul 05 '24 22:07 mssonicbld

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines[bot] avatar Jul 05 '24 22:07 azure-pipelines[bot]

/azpw run Azure.sonic-mgmt

andywongarista avatar Jul 08 '24 17:07 andywongarista

/AzurePipelines run Azure.sonic-mgmt

mssonicbld avatar Jul 08 '24 17:07 mssonicbld

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines[bot] avatar Jul 08 '24 17:07 azure-pipelines[bot]

test-failure.log

vaibhavhd avatar Jul 09 '24 21:07 vaibhavhd

2024-07-08 19:45:20 : SSH thread VM=10.250.0.54: Output on stderr from command 'sudo cat /var/log/syslog{.1,}': 'cat: /var/log/syslog.1: No such file or directory'
2024-07-08 19:45:20 : SSH thread VM=10.250.0.53: Output on stderr from command 'sudo cat /var/log/syslog{.1,}': 'cat: /var/log/syslog.1: No such file or directory'
2024-07-08 19:45:20 : SSH thread VM=10.250.0.54: Output on stderr from command 'sudo cat /var/log/teamd.log{.1,}': 'cat: /var/log/teamd.log.1: No such file or directory'
2024-07-08 19:45:20 : SSH thread VM=10.250.0.53: Output on stderr from command 'sudo cat /var/log/teamd.log{.1,}': 'cat: /var/log/teamd.log.1: No such file or directory'
2024-07-08 19:45:21 : SSH thread VM=10.250.0.54: Log output ""
2024-07-08 19:45:21 : Error in HostDevice: Traceback (most recent call last):
  File "ptftests/py3/advanced-reboot.py", line 1603, in peer_state_check
    self.fails[ip], self.info[ip], self.cli_info[ip], self.logs_info[ip], self.lacp_pdu_times[ip] = ssh.run()
  File "ptftests/py3/sonic.py", line 188, in run
    log_data = self.parse_logs(parsed_logs)
  File "ptftests/py3/sonic.py", line 290, in parse_logs
    po_carrier_data = [k for k in data["teamd#teamd_PortChannel1"] if "carrier changed to" in k[1]]
KeyError: 'teamd#teamd_PortChannel1'

2024-07-08 19:45:21 : SSH thread for VM 10.250.0.54 finished
2024-07-08 19:45:21 : lacp_pdu_all_times: IP:10.250.0.54: []
2024-07-08 19:45:21 : Send    10 Received    10 ping DUT
2024-07-08 19:45:21 : SSH thread VM=10.250.0.53: Log output ""
2024-07-08 19:45:21 : Error in HostDevice: Traceback (most recent call last):
  File "ptftests/py3/advanced-reboot.py", line 1603, in peer_state_check
    self.fails[ip], self.info[ip], self.cli_info[ip], self.logs_info[ip], self.lacp_pdu_times[ip] = ssh.run()
  File "ptftests/py3/sonic.py", line 188, in run
    log_data = self.parse_logs(parsed_logs)
  File "ptftests/py3/sonic.py", line 290, in parse_logs
    po_carrier_data = [k for k in data["teamd#teamd_PortChannel1"] if "carrier changed to" in k[1]]
KeyError: 'teamd#teamd_PortChannel1'

2024-07-08 19:45:21 : SSH thread for VM 10.250.0.53 finished
2024-07-08 19:45:21 : lacp_pdu_all_times: IP:10.250.0.53: []
2024-07-08 19:45:21 : SSH thread VM=10.250.0.51: Output on stderr from command 'sudo cat /var/log/teamd.log{.1,}': 'cat: /var/log/teamd.log.1: No such file or directory'
2024-07-08 19:45:21 : SSH thread VM=10.250.0.52: Output on stderr from command 'sudo cat /var/log/teamd.log{.1,}': 'cat: /var/log/teamd.log.1: No such file or directory'
2024-07-08 19:45:21 : SSH thread VM=10.250.0.51: Log output ""
2024-07-08 19:45:21 : Error in HostDevice: Traceback (most recent call last):
  File "ptftests/py3/advanced-reboot.py", line 1603, in peer_state_check
    self.fails[ip], self.info[ip], self.cli_info[ip], self.logs_info[ip], self.lacp_pdu_times[ip] = ssh.run()
  File "ptftests/py3/sonic.py", line 188, in run
    log_data = self.parse_logs(parsed_logs)
  File "ptftests/py3/sonic.py", line 290, in parse_logs
    po_carrier_data = [k for k in data["teamd#teamd_PortChannel1"] if "carrier changed to" in k[1]]
KeyError: 'teamd#teamd_PortChannel1'

2024-07-08 19:45:21 : SSH thread for VM 10.250.0.51 finished
2024-07-08 19:45:21 : lacp_pdu_all_times: IP:10.250.0.51: []
2024-07-08 19:45:21 : SSH thread VM=10.250.0.52: Log output ""
2024-07-08 19:45:21 : Error in HostDevice: Traceback (most recent call last):
  File "ptftests/py3/advanced-reboot.py", line 1603, in peer_state_check
    self.fails[ip], self.info[ip], self.cli_info[ip], self.logs_info[ip], self.lacp_pdu_times[ip] = ssh.run()
  File "ptftests/py3/sonic.py", line 188, in run
    log_data = self.parse_logs(parsed_logs)
  File "ptftests/py3/sonic.py", line 290, in parse_logs
    po_carrier_data = [k for k in data["teamd#teamd_PortChannel1"] if "carrier changed to" in k[1]]
KeyError: 'teamd#teamd_PortChannel1'

vaibhavhd avatar Jul 09 '24 21:07 vaibhavhd

Next step - please try to run this on a local KVM testbed.

vaibhavhd avatar Jul 09 '24 21:07 vaibhavhd

Latest failure:

==================================================\n2024-07-16 17:43:11 : Report:\n2024-07-16 17:43:11 : ==================================================\n
2024-07-16 17:43:11 : LACP/BGP were down for (extracted from cli):\n2024-07-16 17:43:11 : --------------------------------------------------\n
2024-07-16 17:43:11 : --------------------------------------------------
\n2024-07-16 17:43:11 : Extracted from VM logs:
\n2024-07-16 17:43:11 : --------------------------------------------------
\n2024-07-16 17:43:11 : Summary:\n2024-07-16 17:43:11 : --------------------------------------------------
\n2024-07-16 17:43:11 : --------------------------------------------------
\n2024-07-16 17:43:11 : Fails:\n2024-07-16 17:43:11 : --------------------------------------------------
\n2024-07-16 17:43:11 : FAILED:dut:Traceback (most recent call last):\n  File \"ptftests/py3/advanced-reboot.py\", line 407, in timeout\n    res = async_res.get(timeout=seconds)
\n  File \"/usr/lib/python3.7/multiprocessing/pool.py\", line 653, in get
\n    raise TimeoutError\nmultiprocessing.context.TimeoutError\
n\nDuring handling of the above exception, another exception occurred:
\n\nTraceback (most recent call last):
\n  File \"ptftests/py3/advanced-reboot.py\", line 1344, in runTest
\n    self.wait_until_control_plane_down()
\n  File \"ptftests/py3/advanced-reboot.py\", line 1020, in wait_until_control_plane_down
\n    \"DUT hasn't shutdown in {} seconds\".format(self.control_plane_down_timeout))
\n  File \"ptftests/py3/advanced-reboot.py\", line 413, in timeout
\n    raise type(err)(\"{}: {}\".format(message, traceback_msg))\nmultiprocessing.context.TimeoutError: DUT hasn't shutdown in 600 seconds: Traceback (most recent call last):
\n  File \"ptftests/py3/advanced-reboot.py\", line 407, in timeou
t\n    res = async_res.get(timeout=seconds)\n  File \"/usr/lib/python3.7/multiprocessing/pool.py\", line 653, in get
\n    raise TimeoutError\nmultiprocessing.context.TimeoutError\n\n\n2024-07-16 17:43:11 : ==================================================\n

vaibhavhd avatar Jul 16 '24 20:07 vaibhavhd

/Azpw run

vaibhavhd avatar Aug 12 '24 22:08 vaibhavhd

@andywongarista the tests are passing now. Can you re-run the PR tests, so that we know that the PR tests pass consistently. This is so that we do not block other PRs if this PR introduces a flaky issue.

vaibhavhd avatar Aug 12 '24 22:08 vaibhavhd

/azpw run Azure.sonic-mgmt

andywongarista avatar Aug 13 '24 21:08 andywongarista

/AzurePipelines run Azure.sonic-mgmt

mssonicbld avatar Aug 13 '24 21:08 mssonicbld

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines[bot] avatar Aug 13 '24 21:08 azure-pipelines[bot]

@StormLiangMS , please help to merge this PR.

vaibhavhd avatar Aug 14 '24 16:08 vaibhavhd

@StormLiangMS please merge this PR

andywongarista avatar Aug 29 '24 18:08 andywongarista