origin icon indicating copy to clipboard operation
origin copied to clipboard

OCPBUGS-61976: fix(azure-metrics): add retry logic with exponential backoff for loadbalancer lookup

Open jparrill opened this issue 3 months ago • 73 comments

What this PR does / why we need it

Azure AKS conformance tests were failing because the azure-metrics-collector was attempting to access load balancers before they were fully created, resulting in 404 ResourceNotFound errors.

This change adds retry logic with exponential backoff to the getLoadBalancerID function:

  • Retry up to 10 times for 404 errors only
  • Initial backoff of 10s, doubling each retry up to 5min max
  • Total max wait time ~20 minutes
  • Proper context cancellation handling
  • Detailed logging for debugging

The fix ensures the metrics collector waits for load balancer creation instead of failing immediately on timing issues.

Which issue(s) this PR fixes

Additional Info

Job try 1 try 2 try 3 try 4
periodic-ci-openshift-hypershift-release-4.20-periodics-e2e-azure-aks-ovn-conformance 🟢 ? ? ?
periodic-ci-openshift-hypershift-release-4.21-periodics-e2e-azure-aks-ovn-conformance 🟢 ? ? ?
periodic-ci-openshift-hypershift-release-4.20-periodics-e2e-aks 🟢 ? ? ?
periodic-ci-openshift-hypershift-release-4.21-periodics-e2e-aks 🟢 ? ? ?

jparrill avatar Sep 19 '25 10:09 jparrill

Skipping CI for Draft Pull Request. If you want CI signal for your change, please convert it to an actual PR. You can still manually trigger a test run with /test all

openshift-ci[bot] avatar Sep 19 '25 10:09 openshift-ci[bot]

/test unit

jparrill avatar Sep 19 '25 10:09 jparrill

/test verify

jparrill avatar Sep 19 '25 10:09 jparrill

/test e2e-azure

jparrill avatar Sep 19 '25 10:09 jparrill

@jparrill: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-hypershift-release-4.20-periodics-e2e-azure-aks-ovn-conformance

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/a65e5280-955a-11f0-9c55-15d55240b00c-0

openshift-ci[bot] avatar Sep 19 '25 13:09 openshift-ci[bot]

@jparrill: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-hypershift-release-4.20-periodics-e2e-azure-aks-ovn-conformance

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/a9663cae-955a-11f0-85a6-3df681f70f55-0

openshift-ci[bot] avatar Sep 19 '25 13:09 openshift-ci[bot]

/payload-job periodic-ci-openshift-hypershift-release-4.21-periodics-e2e-azure-aks-ovn-conformance

jparrill avatar Sep 19 '25 13:09 jparrill

@jparrill: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-hypershift-release-4.21-periodics-e2e-azure-aks-ovn-conformance

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/ae7da330-955a-11f0-8893-f3540b276085-0

openshift-ci[bot] avatar Sep 19 '25 13:09 openshift-ci[bot]

/test e2e-azure

jparrill avatar Sep 22 '25 08:09 jparrill

/payload-job periodic-ci-openshift-hypershift-release-4.21-periodics-e2e-azure-aks-ovn-conformance

jparrill avatar Sep 22 '25 08:09 jparrill

@jparrill: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-hypershift-release-4.21-periodics-e2e-azure-aks-ovn-conformance

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/649155e0-978b-11f0-8b53-b976e8dd9b68-0

openshift-ci[bot] avatar Sep 22 '25 08:09 openshift-ci[bot]

/payload-job periodic-ci-openshift-hypershift-release-4.20-periodics-e2e-azure-aks-ovn-conformance

jparrill avatar Sep 22 '25 08:09 jparrill

@jparrill: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-hypershift-release-4.20-periodics-e2e-azure-aks-ovn-conformance

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/0a2b4970-978c-11f0-9d3c-513df797b711-0

openshift-ci[bot] avatar Sep 22 '25 08:09 openshift-ci[bot]

/payload periodic-ci-openshift-hypershift-release-4.21-periodics-e2e-aks

jparrill avatar Sep 22 '25 08:09 jparrill

@jparrill: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info.

openshift-ci[bot] avatar Sep 22 '25 08:09 openshift-ci[bot]

/payload-job periodic-ci-openshift-hypershift-release-4.20-periodics-e2e-aks

jparrill avatar Sep 22 '25 08:09 jparrill

@jparrill: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-hypershift-release-4.20-periodics-e2e-aks

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/cb48dc40-9790-11f0-8d01-b886ad32595f-0

openshift-ci[bot] avatar Sep 22 '25 08:09 openshift-ci[bot]

/unassign @deads2k

jparrill avatar Sep 22 '25 10:09 jparrill

/assign @bryan-cox

jparrill avatar Sep 22 '25 10:09 jparrill

/assign @sjenning

jparrill avatar Sep 22 '25 10:09 jparrill

/unassign @p0lyn0mial

jparrill avatar Sep 22 '25 10:09 jparrill

/uncc @p0lyn0mial

jparrill avatar Sep 22 '25 10:09 jparrill

/uncc @deads2k

jparrill avatar Sep 22 '25 10:09 jparrill

/jira refresh

jparrill avatar Sep 22 '25 10:09 jparrill

@jparrill: This pull request references Jira Issue OCPBUGS-61976, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

What this PR does / why we need it

Azure AKS conformance tests were failing because the azure-metrics-collector was attempting to access load balancers before they were fully created, resulting in 404 ResourceNotFound errors.

This change adds retry logic with exponential backoff to the getLoadBalancerID function:

  • Retry up to 10 times for 404 errors only
  • Initial backoff of 10s, doubling each retry up to 5min max
  • Total max wait time ~20 minutes
  • Proper context cancellation handling
  • Detailed logging for debugging

The fix ensures the metrics collector waits for load balancer creation instead of failing immediately on timing issues.

Which issue(s) this PR fixes

Additional Info

Job try 1 try 2 try 3 try 4
periodic-ci-openshift-hypershift-release-4.20-periodics-e2e-azure-aks-ovn-conformance 🟢 ? ? ?
periodic-ci-openshift-hypershift-release-4.21-periodics-e2e-azure-aks-ovn-conformance 🟢 ? ? ?
periodic-ci-openshift-hypershift-release-4.20-periodics-e2e-aks ? ? ? ?
periodic-ci-openshift-hypershift-release-4.21-periodics-e2e-aks ? ? ? ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Sep 22 '25 10:09 openshift-ci-robot

@jparrill: This pull request references Jira Issue OCPBUGS-61976, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Sep 22 '25 10:09 openshift-ci-robot

/payload-job periodic-ci-openshift-hypershift-release-4.21-periodics-e2e-aks

jparrill avatar Sep 22 '25 12:09 jparrill

@jparrill: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-hypershift-release-4.21-periodics-e2e-aks

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/f3caf1c0-97af-11f0-8d4a-4e7fa790b0c5-0

openshift-ci[bot] avatar Sep 22 '25 12:09 openshift-ci[bot]

/payload-job periodic-ci-openshift-hypershift-release-4.20-periodics-e2e-aks

jparrill avatar Sep 22 '25 12:09 jparrill

@jparrill: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-hypershift-release-4.20-periodics-e2e-aks

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/0a0ba920-97b0-11f0-943c-c7577bfb6922-0

openshift-ci[bot] avatar Sep 22 '25 12:09 openshift-ci[bot]