sonic-mgmt
sonic-mgmt copied to clipboard
Issue introduced by PR 8089, sometimes warm-reboot will be stuck in endless loop
Description This PR:https://github.com/sonic-net/sonic-mgmt/pull/8089/ might cause some issues.
while not (self.sniff_thr.isAlive() and self.sender_thr.isAlive()):
time.sleep(1)
We cannot make sure that self.sniff_thr and self.sender_thr all are always alive. Once one thread of self.sniff_thr and self.sender_thr is finished, test will be in endless loop. For example: When run warm-reboot sad inboot test, sometimes the thread of self.sender_thr will finish before this check point, it will cause the test come into endless loop until ptf timeout.
Steps to reproduce the issue:
- Run warm-reboot sad inboot case
Describe the results you received:
Describe the results you expected:
Additional information you deem important:
**Output of `show version`:**
```
(paste your output here)
```
**Attach debug file `sudo generate_dump`:**
```
(paste your output here)
```
@vaibhavhd can you further triage this issue?
Looks like #8089 was reverted already
Hi @yxieca , #8089 is only reverted on 202205 branch. The issue still exists on master/202305 branch. I tried to revert the commit but there is conflict. @vaibhavhd Could you help check and fix the issue or revert #8089 on master/202305?
Thanks.
Hi @yxieca , #8089 is only reverted on 202205 branch. The issue still exists on master/202305 branch. I tried to revert the commit but there is conflict. @vaibhavhd Could you help check and fix the issue or revert #8089 on master/202305?
Thanks.
I find this is not exactly the same issue, I have opened another ticket to track: https://github.com/sonic-net/sonic-mgmt/issues/10362.
Here is the possible cause of this issue, if any exception happened in the code before the sender and sniffer threads are started, the test will be trapped in the while loop mentioned in the description above. https://github.com/sonic-net/sonic-mgmt/blob/f289ef51284cae72590e84a2d2c30efa9bdc2654/ansible/roles/test/files/ptftests/py3/advanced-reboot.py#L1469-L1512 I have got a ptf log of this failure, it appears an exception happened in self.wait_until_teamd_goes_down(), and the sender/sniffer threads were never been started. warm-reboot.log So, any exceptions before starting the sender/sniffer should be handled.
@congh-nvidia @yxieca should we close this issue for 202205 and keep only the other one for master and 202305?
@vaibhavhd ping
Hi @liat-grozovik @yxieca , the test has changed a lot after this issue was opened. And as far as I know, we are not experiencing this issue now. I think we can close it. @JibinBao please confirm.
Hi @liat-grozovik @yxieca , the test has changed a lot after this issue was opened. And as far as I know, we are not experiencing this issue now. I think we can close it. @JibinBao please confirm.
I agree.