ratify icon indicating copy to clipboard operation
ratify copied to clipboard

ci: update azure SP federated credentials

Open akashsinghal opened this issue 9 months ago • 3 comments

Description

What this PR does / why we need it:

This PR migrates the Azure SPN used for AKS e2e tests to use federated identities via OIDC

  • It removes dependence on SPN secret
  • It adds a new aks-deploy environment for jobs that require azure auth
  • It force bumps the az cli version so that it is >=2.60.0 (Note: this can be removed after the default github runner image uses a newer version of the az cli https://github.com/actions/runner-images?tab=readme-ov-file#available-images)
  • It caches access tokens for registry scope and key vault scope so that azure resource operations will not fail auth after 5 minutes.

Once PR merges to dev:

  • delete ALL secrets in SP
  • each of the maintainers can choose to create a federated credential to their own fork under environment aks-deploy

Which issue(s) this PR fixes (optional, using fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when the PR gets merged):

Fixes #1435

Type of change

Please delete options that are not relevant.

  • [ ] Bug fix (non-breaking change which fixes an issue)
  • [ ] New feature (non-breaking change which adds functionality)
  • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • [ ] Helm Chart Change (any edit/addition/update that is necessary for changes merged to the main branch)
  • [ ] This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Please also list any relevant details for your test configuration

  • [ ] Test A
  • [ ] Test B

Checklist:

  • [ ] Does the affected code have corresponding tests?
  • [ ] Are the changes documented, not just with inline documentation, but also with conceptual documentation such as an overview of a new feature, or task-based documentation like a tutorial? Consider if this change should be announced on your project blog.
  • [ ] Does this introduce breaking changes that would require an announcement or bumping the major version?
  • [ ] Do all new files have appropriate license header?

Post Merge Requirements

  • [ ] MAINTAINERS: manually trigger the "Publish Package" workflow after merging any PR that indicates Helm Chart Change

akashsinghal avatar May 06 '24 20:05 akashsinghal

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 68.16%. Comparing base (1bd347c) to head (5f9861c). Report is 14 commits behind head on dev.

Additional details and impacted files
@@            Coverage Diff             @@
##              dev    #1442      +/-   ##
==========================================
+ Coverage   66.76%   68.16%   +1.39%     
==========================================
  Files         116      119       +3     
  Lines        6030     6134     +104     
==========================================
+ Hits         4026     4181     +155     
+ Misses       1620     1559      -61     
- Partials      384      394      +10     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar May 06 '24 20:05 codecov[bot]

@susanshi Here are some findings on current state of the aks test cleanup:

  1. The aks-test-cleanup job is currently set to need the aks-test job in order to run. Currently, the needs trigger by definition requires success of the job it needs. In our case, not all AKS tests usually succeed so we want a trigger that will only run after the AKS test if it is triggered BUT regardless if the AKS tests succeeded or failed. I looked at the docs and it makes it seem like it is possible: https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#example-not-requiring-successful-dependent-jobs but this doesn't seem to work. I tried and with always() the cleanup step always runs irrespective of if AKS test ran or not which is not what we want. I also tried to add a conditional if that is the same as the AKS test trigger but keep the needs. That also does not seem to work.
  2. The current AKS test has a cleanup() function that deletes the resource group upon a failure/exit. That seems to work however it may not always work? I do sometimes see that there are dangling resource groups.

I'm thinking maybe this cleanup optimization is out of scope for this PR?

akashsinghal avatar May 07 '24 18:05 akashsinghal

verified on local fork at least one test passes: image

akashsinghal avatar May 07 '24 23:05 akashsinghal

@susanshi Here are some findings on current state of the aks test cleanup:

  1. The aks-test-cleanup job is currently set to need the aks-test job in order to run. Currently, the needs trigger by definition requires success of the job it needs. In our case, not all AKS tests usually succeed so we want a trigger that will only run after the AKS test if it is triggered BUT regardless if the AKS tests succeeded or failed. I looked at the docs and it makes it seem like it is possible: https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#example-not-requiring-successful-dependent-jobs but this doesn't seem to work. I tried and with always() the cleanup step always runs irrespective of if AKS test ran or not which is not what we want. I also tried to add a conditional if that is the same as the AKS test trigger but keep the needs. That also does not seem to work.
  2. The current AKS test has a cleanup() function that deletes the resource group upon a failure/exit. That seems to work however it may not always work? I do sometimes see that there are dangling resource groups.

I'm thinking maybe this cleanup optimization is out of scope for this PR?

Thanks for the investigation. We added the cleanup job because the cleanup() function could be skipped. I agree it's not in scope of this PR, we can fix it later.

binbin-li avatar May 08 '24 01:05 binbin-li