sonic-mgmt
sonic-mgmt copied to clipboard
Add more retry times when upgrade image
Signed-off-by: Zhaohui Sun [email protected]
Description of PR
Summary: Fixes # (issue)
Type of change
- [ ] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [x] Test case(new/improvement)
Back port request
- [ ] 201911
- [ ] 202012
- [ ] 202205
Approach
What is the motivation for this PR?
upgrade image may fail because upgrade image takes more than 5m. For example, upgrade mellanox testbed with master image may take more than 5m sometimes. But not sure why default timeout is 5m.
2022-11-11T11:14:05.9143210Z Friday 11 November 2022 11:14:05 +0000 (0:00:00.026) 0:00:01.760 *******
2022-11-11T11:19:06.4983686Z fatal: [str-msn2700-02]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Shared connection to 10.3.147.45 closed.", "unreachable": true}
How did you do it?
Refer to ansible document: https://docs.ansible.com/ansible/latest/user_guide/playbooks_async.html#:~:text=If%20you%20want%20to%20run,longer%20than%20its%20async%20value Use async and poll to run script asynchronously. Set timeout to 500s.
How did you verify/test it?
Run ansible-playbook upgrade_sonic.yml
Any platform specific information?
Supported testbed topology if it's a new test case?
Documentation
If adding retry from 5 to 10 can work around this, it indicates that there is a chance upgrading can complete in 5 minutes if we retry more times.
It looks like the more fundamental issue is that if any ansible module needs more than 5 minutes to complete, ansible will force terminate execution of this task before it is completed. This sounds like a new fundamental issue.
So, I don't think simply increasing retry times is the correct way to fix this issue.
The root reason is introduced by this PR https://github.com/sonic-net/sonic-buildimage/pull/12109. SSH default timeout time is updated from 15m to 5m. Close this PR as it will have some other fix for this issue.