aks-set-context icon indicating copy to clipboard operation
aks-set-context copied to clipboard

Feature Request: Add option to retry failed operations

Open jooooel opened this issue 1 year ago • 4 comments

Feature request

We're using this action in a highly concurrent setup, and from time to time we get errors like these on some of the runs:

Run azure/aks-set-context@v[3]
  with:
    cluster-name: <cluster-name>
    resource-group: <resource-group>
    admin: false
    use-kubelogin: true
  env:
    AZURE_HTTP_USER_AGENT: 
    AZUREPS_HOST_ENVIRONMENT: 
/usr/bin/az aks get-credentials --resource-group <resource-group> --name <cluster-name> -f /home/runner/work/_temp/kubeconfig_1679069
ERROR: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Error: Error: The process '/usr/bin/az' failed with exit code 1

I'm guessing it's because of temporary issues because of the many concurrent connections, and we'd probably see fewer errors if any failed azure cli commands can be retried.

My suggestion is adding the following input parameters, and use them to retry failing azure cli commands:

   retries:
      description: 'Number of times to retry setting the context'
      default: 0
      required: false
   retry-delay:
      description: 'Time to wait (in ms) between retries'
      default: 0
      required: false

We're currently running a fork with this is implemented, but I would prefer to clean it up and create a proper PR if you're interested?

jooooel avatar Mar 20 '23 13:03 jooooel

This issue is idle because it has been open for 14 days with no activity.

github-actions[bot] avatar Apr 03 '23 15:04 github-actions[bot]

Hello @jooooel! Can you elaborate on what the "highly concurrent setup" means? Is this in reference to the runner or the actual workflow? I'm trying to understand the cause of the error. If the RemoteDisconnected error comes from the azure/aks-set-context action (in a proper environment) I agree that we need to add retries. If the "highly concurrent setup" is what's causing the errors, instead of azure/aks-set-context, I'm not sure it makes sense to add here.

If the problem is coming from an improper environment configuration (thus making the error expected), it makes sense to handle the error at the level causing that misconfiguration. This could done with one of the solutions detailed in this article (approach 2 or 3).

OliverMKing avatar Apr 03 '23 17:04 OliverMKing

Hi @OliverMKing! I'm not really sure I understand what you mean the differences are. Here's a snippet of the yaml (I have removed irrelevant parts):

jobs:
  reconciliation:
    name: "Reconciliation"
    runs-on: "ubuntu-latest"

    strategy:
      fail-fast: false # We want to know if there are more than one failing job when we fail
      matrix:
        kustomizations: ${{ fromJSON(inputs.updated_kustomizations) }} # This is a list of 30-40 items
    steps:
      - name: "Set kubectl context"
        uses: jooooel/aks-set-context@main
        with:
          cluster-name: "${{ inputs.aks_cluster }}"
          resource-group: "${{ inputs.resource_group }}"
          admin: "false"
          use-kubelogin: "true"

The error does come from the azure/aks-set-context action. As you can see from the yaml we're using it in a job that is part of a matrix (with about 30-40 jobs). They all try to connect to the same cluster, performing different kinds of validations. I don't think all of them are running at the exact same time, but enough of them are for us to get these connection errors from time to time.

My guess it might be related to some throttling or network issues on the AKS side, but I've been unable to verify that. I've been in contact with Azure support to try to figure out the root cause of the connection issues, but they were of no help unfortunately.

I understand your hesitation to put this into the action, and we are currently fine running our fork (where I added retries) if that's the case.

jooooel avatar Apr 04 '23 06:04 jooooel

This issue is idle because it has been open for 14 days with no activity.

github-actions[bot] avatar Apr 18 '23 09:04 github-actions[bot]