sonic-mgmt icon indicating copy to clipboard operation
sonic-mgmt copied to clipboard

Issue introduced by PR 8089, sometimes warm-reboot will be stuck in endless loop

Open JibinBao opened this issue 1 year ago • 6 comments

Description This PR:https://github.com/sonic-net/sonic-mgmt/pull/8089/ might cause some issues.

while not (self.sniff_thr.isAlive() and self.sender_thr.isAlive()):
           time.sleep(1)

We cannot make sure that self.sniff_thr and self.sender_thr all are always alive. Once one thread of self.sniff_thr and self.sender_thr is finished, test will be in endless loop. For example: When run warm-reboot sad inboot test, sometimes the thread of self.sender_thr will finish before this check point, it will cause the test come into endless loop until ptf timeout.

Steps to reproduce the issue:

  1. Run warm-reboot sad inboot case

Describe the results you received:

Describe the results you expected:

Additional information you deem important:

**Output of `show version`:**

```
(paste your output here)
```

**Attach debug file `sudo generate_dump`:**

```
(paste your output here)
```

JibinBao avatar Sep 19 '23 10:09 JibinBao

@vaibhavhd can you further triage this issue?

yxieca avatar Oct 04 '23 15:10 yxieca

Looks like #8089 was reverted already

yxieca avatar Oct 04 '23 15:10 yxieca

Hi @yxieca , #8089 is only reverted on 202205 branch. The issue still exists on master/202305 branch. I tried to revert the commit but there is conflict. @vaibhavhd Could you help check and fix the issue or revert #8089 on master/202305?

Thanks.

congh-nvidia avatar Oct 16 '23 10:10 congh-nvidia

Hi @yxieca , #8089 is only reverted on 202205 branch. The issue still exists on master/202305 branch. I tried to revert the commit but there is conflict. @vaibhavhd Could you help check and fix the issue or revert #8089 on master/202305?

Thanks.

I find this is not exactly the same issue, I have opened another ticket to track: https://github.com/sonic-net/sonic-mgmt/issues/10362.

congh-nvidia avatar Oct 17 '23 02:10 congh-nvidia

Here is the possible cause of this issue, if any exception happened in the code before the sender and sniffer threads are started, the test will be trapped in the while loop mentioned in the description above. https://github.com/sonic-net/sonic-mgmt/blob/f289ef51284cae72590e84a2d2c30efa9bdc2654/ansible/roles/test/files/ptftests/py3/advanced-reboot.py#L1469-L1512 I have got a ptf log of this failure, it appears an exception happened in self.wait_until_teamd_goes_down(), and the sender/sniffer threads were never been started. warm-reboot.log So, any exceptions before starting the sender/sniffer should be handled.

congh-nvidia avatar Nov 09 '23 11:11 congh-nvidia

@congh-nvidia @yxieca should we close this issue for 202205 and keep only the other one for master and 202305?

liat-grozovik avatar Apr 16 '24 14:04 liat-grozovik

@vaibhavhd ping

yxieca avatar Apr 18 '24 21:04 yxieca

Hi @liat-grozovik @yxieca , the test has changed a lot after this issue was opened. And as far as I know, we are not experiencing this issue now. I think we can close it. @JibinBao please confirm.

congh-nvidia avatar Apr 19 '24 02:04 congh-nvidia

Hi @liat-grozovik @yxieca , the test has changed a lot after this issue was opened. And as far as I know, we are not experiencing this issue now. I think we can close it. @JibinBao please confirm.

I agree.

JibinBao avatar Apr 19 '24 02:04 JibinBao