sonic-mgmt
sonic-mgmt copied to clipboard
Test test_service_warm_restart gets stuck in dead loop.
Description
This issue is caused by PR https://github.com/sonic-net/sonic-mgmt/pull/8089 and PR https://github.com/sonic-net/sonic-mgmt/pull/8993: PR 8089 added this while loop: https://github.com/sonic-net/sonic-mgmt/blob/master/ansible/roles/test/files/ptftests/py3/advanced-reboot.py#L1069-L1071
# wait until sniffer and sender threads have started
while not (self.sniff_thr.isAlive() and self.sender_thr.isAlive()):
time.sleep(1)
And PR 8993 moved the logic to start the sniff_thr and sender_thr,which caused they will never be starrted in the test_service_warm_restart test: https://github.com/sonic-net/sonic-mgmt/blob/master/ansible/roles/test/files/ptftests/py3/advanced-reboot.py#L1469-L1515
def reboot_dut(self):
time.sleep(self.reboot_delay)
self.log("Rebooting remote side")
if self.reboot_type != 'service-warm-restart' and self.test_params['other_vendor_flag'] is False:
# Check to see if the warm-reboot script knows about the retry count feature
stdout, stderr, return_code = self.dut_connection.execCommand(
"sudo " + self.reboot_type + " -h", timeout=5)
if "retry count" in stdout:
if self.test_params['neighbor_type'] == "sonic":
reboot_command = self.reboot_type + " -N"
else:
reboot_command = self.reboot_type + " -n"
else:
reboot_command = self.reboot_type
# create an empty log file to capture output of reboot command
reboot_log_file = "/host/{}.log".format(reboot_command.replace(' ', ''))
self.dut_connection.execCommand("sudo touch {}; sudo chmod 666 {}".format(
reboot_log_file, reboot_log_file))
# execute reboot command w/ nohup so that when the execCommand times-out:
# 1. there is a reader/writer for any bash commands using PIPE
# 2. the output and error of CLI still gets written to log file
stdout, stderr, return_code = self.dut_connection.execCommand(
"nohup sudo {} -v &> {}".format(
reboot_command, reboot_log_file), timeout=10)
elif self.test_params['other_vendor_flag'] is True:
ignore_db_integrity_check = " -d"
stdout, stderr, return_code = self.dut_connection.execCommand(
"sudo " + self.reboot_type + ignore_db_integrity_check, timeout=10)
else:
self.restart_service()
return ----- returned before the threads are started in the service restart test
if not self.kvm_test and\
(self.reboot_type == 'fast-reboot' or 'warm-reboot' in
self.reboot_type or 'service-warm-restart' in self.reboot_type):
# Event for the sniff_in_background status.
self.sniffer_started = threading.Event()
self.wait_until_teamd_goes_down()
self.sniff_thr.start()
self.sender_thr.start()
In the service restart test, the sender sniffer threads are not ever started because the function reboot_dut() got returned before it starts the threads, which will cause endless loop in the while block.
Steps to reproduce the issue:
- Run the test platform_tests/test_service_warm_restart.py::test_service_warm_restart
Describe the results you received: Test got stuck in a dead loop and then timed out.
Describe the results you expected: Test should pass.
Additional information you deem important:
**Output of `show version`:**
```
(paste your output here)
```
**Attach debug file `sudo generate_dump`:**
```
(paste your output here)
```
@vaibhavhd Could you help check this issue? Thanks.
@vaibhavhd before skipping the test failing as of this issue, can you please provide ETA?
@yxieca @vaibhavhd we will loose coverage as of this bug. please prioritise it
@congh-nvidia , you seem to have identified the root cause already. Are you not able to fix this?
If not, @ryanzhu706 can you help take a look at this issue?
Hi @vaibhavhd currently I don't have time to fix this and also I don't quite understand why the position of logic for starting the sniffer and sender was moved in https://github.com/sonic-net/sonic-mgmt/pull/8993, so I'm not quite sure how to fix this.
@ryanzhu706 will take a look and come with fix, if needed. This is low priority on our plate.
Regarding your question about 8993 - this change was done so that we start IO measurement as soon as dataplane impacting services go down in shutdown sequence. And ends when warm/fast-reboot is done.