sonic-mgmt
sonic-mgmt copied to clipboard
Enhance log processing for sonic neighbors
Description of PR
Summary: Fixes # (issue)
Type of change
- [ ] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [x] Test case(new/improvement)
Back port request
- [ ] 202012
- [ ] 202205
- [ ] 202305
- [x] 202311
- [x] 202405
Approach
What is the motivation for this PR?
Current processing involves iterating through sonic logs (which can be large) from the beginning, which is unnecessary since only log lines starting from a particular timestamp are relevant.
How did you do it?
Optimize this processing by doing it in reverse and stopping after the last relevant timestamp.
How did you verify/test it?
Ran test_upgrade_path with SONiC neighbors, verified in warm-reboot.log that SSH threads no longer hang unnecessarily long due to log processing
Any platform specific information?
Supported testbed topology if it's a new test case?
Documentation
The pre-commit check detected issues in the files touched by this pull request. The pre-commit check is a mandatory check, please fix detected issues.
Detailed pre-commit check results:
trim trailing whitespace.................................................Passed
fix end of files.........................................................Passed
check yaml...........................................(no files to check)Skipped
check for added large files..............................................Passed
check python ast.........................................................Passed
flake8...................................................................Failed
- hook id: flake8
- exit code: 1
ansible/roles/test/files/ptftests/sonic.py:184:32: E201 whitespace after '('
ansible/roles/test/files/ptftests/sonic.py:237:5: E303 too many blank lines (2)
flake8...............................................(no files to check)Skipped
check conditional mark sort..........................(no files to check)Skipped
To run the pre-commit checks locally, you can follow below steps:
- Ensure that default python is python3. In sonic-mgmt docker container, default python is python2. You can run the check by activating the python3 virtual environment in sonic-mgmt docker container or outside of sonic-mgmt docker container.
- Ensure that the
pre-commitpackage is installed:
sudo pip install pre-commit
- Go to repository root folder
- Install the pre-commit hooks:
pre-commit install
- Use pre-commit to check staged file:
pre-commit
- Alternatively, you can check committed files using:
pre-commit run --from-ref <commit_id> --to-ref <commit_id>
Would you please fix the pre-commit check?
/azpw run Azure.sonic-mgmt
/AzurePipelines run Azure.sonic-mgmt
Azure Pipelines successfully started running 1 pipeline(s).
@saiarcot895 FYI
Seeing failure in test_advanced_reboot.py::test_warm_reboot in azure pipeline due to a KeyError, but locally I am seeing test failure due to TimeoutError, with AND without my change. Is this the case on your end?
The failure in the pipeline is with sonic neighbors, but it looks like the logs from your run are with cEOS neighbors.
The failure in the pipeline is with sonic neighbors, but it looks like the logs from your run are with cEOS neighbors.
Thanks, this TimeoutError was indeed due to accidentally running with cEOS neighbors.
I reran with sonic neighbors and am seeing the test pass.
logs.tar.gz
/azp run Azure.sonic-mgmt
Azure Pipelines successfully started running 1 pipeline(s).
hi @saiarcot895 any concern to approve?
@StormLiangMS main concern is something is breaking the kvmtest-t0-sonic test case.
/azpw run Azure.sonic-mgmt
/AzurePipelines run Azure.sonic-mgmt
Azure Pipelines successfully started running 1 pipeline(s).
/azpw run Azure.sonic-mgmt
/AzurePipelines run Azure.sonic-mgmt
Azure Pipelines successfully started running 1 pipeline(s).
2024-07-08 19:45:20 : SSH thread VM=10.250.0.54: Output on stderr from command 'sudo cat /var/log/syslog{.1,}': 'cat: /var/log/syslog.1: No such file or directory'
2024-07-08 19:45:20 : SSH thread VM=10.250.0.53: Output on stderr from command 'sudo cat /var/log/syslog{.1,}': 'cat: /var/log/syslog.1: No such file or directory'
2024-07-08 19:45:20 : SSH thread VM=10.250.0.54: Output on stderr from command 'sudo cat /var/log/teamd.log{.1,}': 'cat: /var/log/teamd.log.1: No such file or directory'
2024-07-08 19:45:20 : SSH thread VM=10.250.0.53: Output on stderr from command 'sudo cat /var/log/teamd.log{.1,}': 'cat: /var/log/teamd.log.1: No such file or directory'
2024-07-08 19:45:21 : SSH thread VM=10.250.0.54: Log output ""
2024-07-08 19:45:21 : Error in HostDevice: Traceback (most recent call last):
File "ptftests/py3/advanced-reboot.py", line 1603, in peer_state_check
self.fails[ip], self.info[ip], self.cli_info[ip], self.logs_info[ip], self.lacp_pdu_times[ip] = ssh.run()
File "ptftests/py3/sonic.py", line 188, in run
log_data = self.parse_logs(parsed_logs)
File "ptftests/py3/sonic.py", line 290, in parse_logs
po_carrier_data = [k for k in data["teamd#teamd_PortChannel1"] if "carrier changed to" in k[1]]
KeyError: 'teamd#teamd_PortChannel1'
2024-07-08 19:45:21 : SSH thread for VM 10.250.0.54 finished
2024-07-08 19:45:21 : lacp_pdu_all_times: IP:10.250.0.54: []
2024-07-08 19:45:21 : Send 10 Received 10 ping DUT
2024-07-08 19:45:21 : SSH thread VM=10.250.0.53: Log output ""
2024-07-08 19:45:21 : Error in HostDevice: Traceback (most recent call last):
File "ptftests/py3/advanced-reboot.py", line 1603, in peer_state_check
self.fails[ip], self.info[ip], self.cli_info[ip], self.logs_info[ip], self.lacp_pdu_times[ip] = ssh.run()
File "ptftests/py3/sonic.py", line 188, in run
log_data = self.parse_logs(parsed_logs)
File "ptftests/py3/sonic.py", line 290, in parse_logs
po_carrier_data = [k for k in data["teamd#teamd_PortChannel1"] if "carrier changed to" in k[1]]
KeyError: 'teamd#teamd_PortChannel1'
2024-07-08 19:45:21 : SSH thread for VM 10.250.0.53 finished
2024-07-08 19:45:21 : lacp_pdu_all_times: IP:10.250.0.53: []
2024-07-08 19:45:21 : SSH thread VM=10.250.0.51: Output on stderr from command 'sudo cat /var/log/teamd.log{.1,}': 'cat: /var/log/teamd.log.1: No such file or directory'
2024-07-08 19:45:21 : SSH thread VM=10.250.0.52: Output on stderr from command 'sudo cat /var/log/teamd.log{.1,}': 'cat: /var/log/teamd.log.1: No such file or directory'
2024-07-08 19:45:21 : SSH thread VM=10.250.0.51: Log output ""
2024-07-08 19:45:21 : Error in HostDevice: Traceback (most recent call last):
File "ptftests/py3/advanced-reboot.py", line 1603, in peer_state_check
self.fails[ip], self.info[ip], self.cli_info[ip], self.logs_info[ip], self.lacp_pdu_times[ip] = ssh.run()
File "ptftests/py3/sonic.py", line 188, in run
log_data = self.parse_logs(parsed_logs)
File "ptftests/py3/sonic.py", line 290, in parse_logs
po_carrier_data = [k for k in data["teamd#teamd_PortChannel1"] if "carrier changed to" in k[1]]
KeyError: 'teamd#teamd_PortChannel1'
2024-07-08 19:45:21 : SSH thread for VM 10.250.0.51 finished
2024-07-08 19:45:21 : lacp_pdu_all_times: IP:10.250.0.51: []
2024-07-08 19:45:21 : SSH thread VM=10.250.0.52: Log output ""
2024-07-08 19:45:21 : Error in HostDevice: Traceback (most recent call last):
File "ptftests/py3/advanced-reboot.py", line 1603, in peer_state_check
self.fails[ip], self.info[ip], self.cli_info[ip], self.logs_info[ip], self.lacp_pdu_times[ip] = ssh.run()
File "ptftests/py3/sonic.py", line 188, in run
log_data = self.parse_logs(parsed_logs)
File "ptftests/py3/sonic.py", line 290, in parse_logs
po_carrier_data = [k for k in data["teamd#teamd_PortChannel1"] if "carrier changed to" in k[1]]
KeyError: 'teamd#teamd_PortChannel1'
Next step - please try to run this on a local KVM testbed.
Latest failure:
==================================================\n2024-07-16 17:43:11 : Report:\n2024-07-16 17:43:11 : ==================================================\n
2024-07-16 17:43:11 : LACP/BGP were down for (extracted from cli):\n2024-07-16 17:43:11 : --------------------------------------------------\n
2024-07-16 17:43:11 : --------------------------------------------------
\n2024-07-16 17:43:11 : Extracted from VM logs:
\n2024-07-16 17:43:11 : --------------------------------------------------
\n2024-07-16 17:43:11 : Summary:\n2024-07-16 17:43:11 : --------------------------------------------------
\n2024-07-16 17:43:11 : --------------------------------------------------
\n2024-07-16 17:43:11 : Fails:\n2024-07-16 17:43:11 : --------------------------------------------------
\n2024-07-16 17:43:11 : FAILED:dut:Traceback (most recent call last):\n File \"ptftests/py3/advanced-reboot.py\", line 407, in timeout\n res = async_res.get(timeout=seconds)
\n File \"/usr/lib/python3.7/multiprocessing/pool.py\", line 653, in get
\n raise TimeoutError\nmultiprocessing.context.TimeoutError\
n\nDuring handling of the above exception, another exception occurred:
\n\nTraceback (most recent call last):
\n File \"ptftests/py3/advanced-reboot.py\", line 1344, in runTest
\n self.wait_until_control_plane_down()
\n File \"ptftests/py3/advanced-reboot.py\", line 1020, in wait_until_control_plane_down
\n \"DUT hasn't shutdown in {} seconds\".format(self.control_plane_down_timeout))
\n File \"ptftests/py3/advanced-reboot.py\", line 413, in timeout
\n raise type(err)(\"{}: {}\".format(message, traceback_msg))\nmultiprocessing.context.TimeoutError: DUT hasn't shutdown in 600 seconds: Traceback (most recent call last):
\n File \"ptftests/py3/advanced-reboot.py\", line 407, in timeou
t\n res = async_res.get(timeout=seconds)\n File \"/usr/lib/python3.7/multiprocessing/pool.py\", line 653, in get
\n raise TimeoutError\nmultiprocessing.context.TimeoutError\n\n\n2024-07-16 17:43:11 : ==================================================\n
/Azpw run
@andywongarista the tests are passing now. Can you re-run the PR tests, so that we know that the PR tests pass consistently. This is so that we do not block other PRs if this PR introduces a flaky issue.
/azpw run Azure.sonic-mgmt
/AzurePipelines run Azure.sonic-mgmt
Azure Pipelines successfully started running 1 pipeline(s).
@StormLiangMS , please help to merge this PR.
@StormLiangMS please merge this PR