OCPBUGS-61976: fix(azure-metrics): add retry logic with exponential backoff for loadbalancer lookup
What this PR does / why we need it
Azure AKS conformance tests were failing because the azure-metrics-collector was attempting to access load balancers before they were fully created, resulting in 404 ResourceNotFound errors.
This change adds retry logic with exponential backoff to the getLoadBalancerID function:
- Retry up to 10 times for 404 errors only
- Initial backoff of 10s, doubling each retry up to 5min max
- Total max wait time ~20 minutes
- Proper context cancellation handling
- Detailed logging for debugging
The fix ensures the metrics collector waits for load balancer creation instead of failing immediately on timing issues.
Which issue(s) this PR fixes
- Fixes: OCPBUGS-61976
Additional Info
| Job | try 1 | try 2 | try 3 | try 4 |
|---|---|---|---|---|
periodic-ci-openshift-hypershift-release-4.20-periodics-e2e-azure-aks-ovn-conformance |
🟢 | ? | ? | ? |
periodic-ci-openshift-hypershift-release-4.21-periodics-e2e-azure-aks-ovn-conformance |
🟢 | ? | ? | ? |
periodic-ci-openshift-hypershift-release-4.20-periodics-e2e-aks |
🟢 | ? | ? | ? |
periodic-ci-openshift-hypershift-release-4.21-periodics-e2e-aks |
🟢 | ? | ? | ? |
Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all
/test unit
/test verify
/test e2e-azure
@jparrill: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
- periodic-ci-openshift-hypershift-release-4.20-periodics-e2e-azure-aks-ovn-conformance
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/a65e5280-955a-11f0-9c55-15d55240b00c-0
@jparrill: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
- periodic-ci-openshift-hypershift-release-4.20-periodics-e2e-azure-aks-ovn-conformance
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/a9663cae-955a-11f0-85a6-3df681f70f55-0
/payload-job periodic-ci-openshift-hypershift-release-4.21-periodics-e2e-azure-aks-ovn-conformance
@jparrill: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
- periodic-ci-openshift-hypershift-release-4.21-periodics-e2e-azure-aks-ovn-conformance
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/ae7da330-955a-11f0-8893-f3540b276085-0
/test e2e-azure
/payload-job periodic-ci-openshift-hypershift-release-4.21-periodics-e2e-azure-aks-ovn-conformance
@jparrill: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
- periodic-ci-openshift-hypershift-release-4.21-periodics-e2e-azure-aks-ovn-conformance
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/649155e0-978b-11f0-8b53-b976e8dd9b68-0
/payload-job periodic-ci-openshift-hypershift-release-4.20-periodics-e2e-azure-aks-ovn-conformance
@jparrill: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
- periodic-ci-openshift-hypershift-release-4.20-periodics-e2e-azure-aks-ovn-conformance
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/0a2b4970-978c-11f0-9d3c-513df797b711-0
/payload periodic-ci-openshift-hypershift-release-4.21-periodics-e2e-aks
@jparrill: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info.
/payload-job periodic-ci-openshift-hypershift-release-4.20-periodics-e2e-aks
@jparrill: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
- periodic-ci-openshift-hypershift-release-4.20-periodics-e2e-aks
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/cb48dc40-9790-11f0-8d01-b886ad32595f-0
/unassign @deads2k
/assign @bryan-cox
/assign @sjenning
/unassign @p0lyn0mial
/uncc @p0lyn0mial
/uncc @deads2k
/jira refresh
@jparrill: This pull request references Jira Issue OCPBUGS-61976, which is valid. The bug has been moved to the POST state.
3 validation(s) were run on this bug
- bug is open, matching expected state (open)
- bug target version (4.21.0) matches configured target version for branch (4.21.0)
- bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)
No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.
The bug has been updated to refer to the pull request using the external bug tracker.
In response to this:
What this PR does / why we need it
Azure AKS conformance tests were failing because the azure-metrics-collector was attempting to access load balancers before they were fully created, resulting in 404 ResourceNotFound errors.
This change adds retry logic with exponential backoff to the getLoadBalancerID function:
- Retry up to 10 times for 404 errors only
- Initial backoff of 10s, doubling each retry up to 5min max
- Total max wait time ~20 minutes
- Proper context cancellation handling
- Detailed logging for debugging
The fix ensures the metrics collector waits for load balancer creation instead of failing immediately on timing issues.
Which issue(s) this PR fixes
- Fixes: OCPBUGS-61976
Additional Info
Job try 1 try 2 try 3 try 4 periodic-ci-openshift-hypershift-release-4.20-periodics-e2e-azure-aks-ovn-conformance🟢 ? ? ? periodic-ci-openshift-hypershift-release-4.21-periodics-e2e-azure-aks-ovn-conformance🟢 ? ? ? periodic-ci-openshift-hypershift-release-4.20-periodics-e2e-aks? ? ? ? periodic-ci-openshift-hypershift-release-4.21-periodics-e2e-aks? ? ? ?
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
@jparrill: This pull request references Jira Issue OCPBUGS-61976, which is valid.
3 validation(s) were run on this bug
- bug is open, matching expected state (open)
- bug target version (4.21.0) matches configured target version for branch (4.21.0)
- bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.
In response to this:
/jira refresh
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
/payload-job periodic-ci-openshift-hypershift-release-4.21-periodics-e2e-aks
@jparrill: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
- periodic-ci-openshift-hypershift-release-4.21-periodics-e2e-aks
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/f3caf1c0-97af-11f0-8d4a-4e7fa790b0c5-0
/payload-job periodic-ci-openshift-hypershift-release-4.20-periodics-e2e-aks
@jparrill: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
- periodic-ci-openshift-hypershift-release-4.20-periodics-e2e-aks
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/0a0ba920-97b0-11f0-943c-c7577bfb6922-0