sonic-mgmt icon indicating copy to clipboard operation
sonic-mgmt copied to clipboard

Test test_service_warm_restart gets stuck in dead loop.

Open congh-nvidia opened this issue 2 years ago • 2 comments

Description

This issue is caused by PR https://github.com/sonic-net/sonic-mgmt/pull/8089 and PR https://github.com/sonic-net/sonic-mgmt/pull/8993: PR 8089 added this while loop: https://github.com/sonic-net/sonic-mgmt/blob/master/ansible/roles/test/files/ptftests/py3/advanced-reboot.py#L1069-L1071

    # wait until sniffer and sender threads have started
        while not (self.sniff_thr.isAlive() and self.sender_thr.isAlive()):
            time.sleep(1)

And PR 8993 moved the logic to start the sniff_thr and sender_thr,which caused they will never be starrted in the test_service_warm_restart test: https://github.com/sonic-net/sonic-mgmt/blob/master/ansible/roles/test/files/ptftests/py3/advanced-reboot.py#L1469-L1515

 def reboot_dut(self):
        time.sleep(self.reboot_delay)

        self.log("Rebooting remote side")
        if self.reboot_type != 'service-warm-restart' and self.test_params['other_vendor_flag'] is False:
            # Check to see if the warm-reboot script knows about the retry count feature
            stdout, stderr, return_code = self.dut_connection.execCommand(
                "sudo " + self.reboot_type + " -h", timeout=5)
            if "retry count" in stdout:
                if self.test_params['neighbor_type'] == "sonic":
                    reboot_command = self.reboot_type + " -N"
                else:
                    reboot_command = self.reboot_type + " -n"
            else:
                reboot_command = self.reboot_type

            # create an empty log file to capture output of reboot command
            reboot_log_file = "/host/{}.log".format(reboot_command.replace(' ', ''))
            self.dut_connection.execCommand("sudo touch {}; sudo chmod 666 {}".format(
                reboot_log_file, reboot_log_file))

            # execute reboot command w/ nohup so that when the execCommand times-out:
            # 1. there is a reader/writer for any bash commands using PIPE
            # 2. the output and error of CLI still gets written to log file
            stdout, stderr, return_code = self.dut_connection.execCommand(
                "nohup sudo {} -v &> {}".format(
                    reboot_command, reboot_log_file), timeout=10)

        elif self.test_params['other_vendor_flag'] is True:
            ignore_db_integrity_check = " -d"
            stdout, stderr, return_code = self.dut_connection.execCommand(
                "sudo " + self.reboot_type + ignore_db_integrity_check, timeout=10)

        else:
            self.restart_service()
            return    ----- returned before the threads are started in the service restart test

        if not self.kvm_test and\
                (self.reboot_type == 'fast-reboot' or 'warm-reboot' in
                 self.reboot_type or 'service-warm-restart' in self.reboot_type):
            # Event for the sniff_in_background status.
            self.sniffer_started = threading.Event()

            self.wait_until_teamd_goes_down()

            self.sniff_thr.start()
            self.sender_thr.start()

In the service restart test, the sender sniffer threads are not ever started because the function reboot_dut() got returned before it starts the threads, which will cause endless loop in the while block.

Steps to reproduce the issue:

  1. Run the test platform_tests/test_service_warm_restart.py::test_service_warm_restart

Describe the results you received: Test got stuck in a dead loop and then timed out.

Describe the results you expected: Test should pass.

Additional information you deem important:

**Output of `show version`:**

```
(paste your output here)
```

**Attach debug file `sudo generate_dump`:**

```
(paste your output here)
```

congh-nvidia avatar Oct 17 '23 02:10 congh-nvidia

@vaibhavhd Could you help check this issue? Thanks.

congh-nvidia avatar Oct 17 '23 02:10 congh-nvidia

@vaibhavhd before skipping the test failing as of this issue, can you please provide ETA?

liat-grozovik avatar Apr 16 '24 14:04 liat-grozovik

@yxieca @vaibhavhd we will loose coverage as of this bug. please prioritise it

liat-grozovik avatar May 07 '24 11:05 liat-grozovik

@congh-nvidia , you seem to have identified the root cause already. Are you not able to fix this?

If not, @ryanzhu706 can you help take a look at this issue?

vaibhavhd avatar Jun 03 '24 16:06 vaibhavhd

Hi @vaibhavhd currently I don't have time to fix this and also I don't quite understand why the position of logic for starting the sniffer and sender was moved in https://github.com/sonic-net/sonic-mgmt/pull/8993, so I'm not quite sure how to fix this.

congh-nvidia avatar Jun 04 '24 02:06 congh-nvidia

@ryanzhu706 will take a look and come with fix, if needed. This is low priority on our plate.

Regarding your question about 8993 - this change was done so that we start IO measurement as soon as dataplane impacting services go down in shutdown sequence. And ends when warm/fast-reboot is done.

vaibhavhd avatar Jun 24 '24 16:06 vaibhavhd