assisted-test-infra
assisted-test-infra copied to clipboard
Add informative info from assisted-installer-controller log when time…
Add controller later error when timeout during installation
During installation when waiter hit on timeout , raise timeout exception without information about the root cause and the installer timers wait longer time till it change state to fail / success.
Added support for last error in assisted-installer-controller logs when timeout exception raised by the waiter. Added the info from controller logs to raises exception.
Created decorator for waiter functions, in case waiter func wrapped:
- No exception from waiter return func to caller as is
- Waiter bubble-up TimeoutExpired
- veify the waiter called from cluster waiter
- Try to download kubeconfig , success return true
- Try to get assisted-installer-controller logs filtered by level=error
- Add the last error to the exception , if no error nothing to append. Sometime logs call may return empty or fail due to vip network so retry
- raise updated new TimeoutExpired exception to the caller
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: bkopilov Once this PR has been reviewed and has the lgtm label, please assign eranco74 for approval. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment
Hi @bkopilov. Thanks for your PR.
I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test
on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.
Once the patch is verified, the new status will be reflected by the ok-to-test
label.
I understand the commands that are listed here.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
@lalon4
/ok-to-test
@eliajahshan , @talhil-rh
I run CI regression3 with this fix and it worked , on timeout we got more info .
https://ci-jenkins-csb-kniqe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-assisted-installer-virt-tf/12560/testReport/
We see that on installation timeout we get more info:
waiting.exceptions.TimeoutExpired: Timeout of 3600 seconds expired waiting for Nodes to be in of the statuses ['installed']
time="2024-01-09T13:00:35Z" level=error msg="Failed to get list of nodes from k8s client" func="github.com/openshift/assisted-installer/src/assisted_installer_controller.(*controller).waitAndUpdateNodesStatus" file="/remote-source/app/src/assisted_installer_controller/assisted_installer_controller.go:248" error="Get \"https://10.128.0.1:443/api/v1/nodes\": dial tcp 10.128.0.1:443: connect: connection refused" request_id=73103b20-8a7a-4693-8b08-806a3b88c7dd
time="2024-01-09T13:00:35Z" level=error msg="Failed to get list of nodes from k8s client" func="github.com/openshift/assisted-installer/src/assisted_installer_controller.(*controller).waitAndUpdateNodesStatus" file="/remote-source/app/src/assisted_installer_controller/assisted_installer_controller.go:248" error="Get \"https://10.128.0.1:443/api/v1/nodes\": dial tcp 10.128.0.1:443: connect: connection refused" request_id=17f6a60a-7096-43aa-b5db-5b5fa5396b61
Another example:
waiting.exceptions.TimeoutExpired: Timeout of 3600 seconds expired waiting for Monitored ['builtin'] operators to be in of the statuses ['available']
time="2024-01-09T14:17:22Z" level=error msg="Failed to check if console is enabled" func=github.com/openshift/assisted-installer/src/assisted_installer_controller.controller.waitingForClusterOperators.func1 file="/remote-source/app/src/assisted_installer_controller/assisted_installer_controller.go:1003" error="Get \"https://localhost:6443/apis/config.openshift.io/v1/clusterversions/version\": dial tcp [::1]:6443: connect: connection refused"
time="2024-01-09T14:16:56Z" level=error msg="Failed to check if console is enabled" func=github.com/openshift/assisted-installer/src/assisted_installer_controller.controller.waitingForClusterOperators.func1 file="/remote-source/app/src/assisted_installer_controller/assisted_installer_controller.go:1003" error="Get \"https://localhost:6443/apis/config.openshift.io/v1/clusterversions/version\": net/http: TLS handshake timeout - error from a previous attempt: unexpected EOF"
Failed to deploy the following operators ['console']
Updated the log filter to "error=" , i see that some errors in the log file exposed with level=info ...
/retest
@bkopilov: all tests passed!
Full PR test history. Your PR dashboard.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle rotten /remove-lifecycle stale
PR needs rebase.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting /reopen
.
Mark the issue as fresh by commenting /remove-lifecycle rotten
.
Exclude this issue from closing again by commenting /lifecycle frozen
.
/close
@openshift-bot: Closed this PR.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting
/reopen
. Mark the issue as fresh by commenting/remove-lifecycle rotten
. Exclude this issue from closing again by commenting/lifecycle frozen
./close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.