assisted-test-infra Add informative info from assisted-installer-controller log when time…

Add controller later error when timeout during installation

During installation when waiter hit on timeout , raise timeout exception without information about the root cause and the installer timers wait longer time till it change state to fail / success.

Added support for last error in assisted-installer-controller logs when timeout exception raised by the waiter. Added the info from controller logs to raises exception.

Created decorator for waiter functions, in case waiter func wrapped:

No exception from waiter return func to caller as is
Waiter bubble-up TimeoutExpired
- veify the waiter called from cluster waiter
- Try to download kubeconfig , success return true
- Try to get assisted-installer-controller logs filtered by level=error
- Add the last error to the exception , if no error nothing to append. Sometime logs call may return empty or fail due to vip network so retry
- raise updated new TimeoutExpired exception to the caller

Jan 08 '24 22:01 bkopilov

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bkopilov Once this PR has been reviewed and has the lgtm label, please assign eranco74 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Jan 08 '24 22:01 openshift-ci[bot]

Hi @bkopilov. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jan 08 '24 22:01 openshift-ci[bot]

@lalon4

Jan 09 '24 10:01 bkopilov

/ok-to-test

Jan 09 '24 10:01 adriengentil

@eliajahshan , @talhil-rh

Jan 09 '24 13:01 bkopilov

I run CI regression3 with this fix and it worked , on timeout we got more info .

https://ci-jenkins-csb-kniqe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-assisted-installer-virt-tf/12560/testReport/

We see that on installation timeout we get more info:

waiting.exceptions.TimeoutExpired: Timeout of 3600 seconds expired waiting for Nodes to be in of the statuses ['installed']
time="2024-01-09T13:00:35Z" level=error msg="Failed to get list of nodes from k8s client" func="github.com/openshift/assisted-installer/src/assisted_installer_controller.(*controller).waitAndUpdateNodesStatus" file="/remote-source/app/src/assisted_installer_controller/assisted_installer_controller.go:248" error="Get \"https://10.128.0.1:443/api/v1/nodes\": dial tcp 10.128.0.1:443: connect: connection refused" request_id=73103b20-8a7a-4693-8b08-806a3b88c7dd

time="2024-01-09T13:00:35Z" level=error msg="Failed to get list of nodes from k8s client" func="github.com/openshift/assisted-installer/src/assisted_installer_controller.(*controller).waitAndUpdateNodesStatus" file="/remote-source/app/src/assisted_installer_controller/assisted_installer_controller.go:248" error="Get \"https://10.128.0.1:443/api/v1/nodes\": dial tcp 10.128.0.1:443: connect: connection refused" request_id=17f6a60a-7096-43aa-b5db-5b5fa5396b61

Another example:

waiting.exceptions.TimeoutExpired: Timeout of 3600 seconds expired waiting for Monitored ['builtin'] operators to be in of the statuses ['available']
time="2024-01-09T14:17:22Z" level=error msg="Failed to check if console is enabled" func=github.com/openshift/assisted-installer/src/assisted_installer_controller.controller.waitingForClusterOperators.func1 file="/remote-source/app/src/assisted_installer_controller/assisted_installer_controller.go:1003" error="Get \"https://localhost:6443/apis/config.openshift.io/v1/clusterversions/version\": dial tcp [::1]:6443: connect: connection refused"

time="2024-01-09T14:16:56Z" level=error msg="Failed to check if console is enabled" func=github.com/openshift/assisted-installer/src/assisted_installer_controller.controller.waitingForClusterOperators.func1 file="/remote-source/app/src/assisted_installer_controller/assisted_installer_controller.go:1003" error="Get \"https://localhost:6443/apis/config.openshift.io/v1/clusterversions/version\": net/http: TLS handshake timeout - error from a previous attempt: unexpected EOF"
Failed to deploy the following operators ['console']

Jan 09 '24 20:01 bkopilov

Updated the log filter to "error=" , i see that some errors in the log file exposed with level=info ...

Jan 10 '24 08:01 bkopilov

/retest

Jan 11 '24 06:01 bkopilov

@bkopilov: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Jan 14 '24 22:01 openshift-ci[bot]

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Apr 14 '24 01:04 openshift-bot

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

May 14 '24 08:05 openshift-bot

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

May 14 '24 08:05 openshift-merge-robot

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Jun 14 '24 00:06 openshift-bot

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Jun 14 '24 00:06 openshift-ci[bot]