aks-set-context
aks-set-context copied to clipboard
Feature Request: Add option to retry failed operations
Feature request
We're using this action in a highly concurrent setup, and from time to time we get errors like these on some of the runs:
Run azure/aks-set-context@v[3]
with:
cluster-name: <cluster-name>
resource-group: <resource-group>
admin: false
use-kubelogin: true
env:
AZURE_HTTP_USER_AGENT:
AZUREPS_HOST_ENVIRONMENT:
/usr/bin/az aks get-credentials --resource-group <resource-group> --name <cluster-name> -f /home/runner/work/_temp/kubeconfig_1679069
ERROR: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Error: Error: The process '/usr/bin/az' failed with exit code 1
I'm guessing it's because of temporary issues because of the many concurrent connections, and we'd probably see fewer errors if any failed azure cli commands can be retried.
My suggestion is adding the following input parameters, and use them to retry failing azure cli commands:
retries:
description: 'Number of times to retry setting the context'
default: 0
required: false
retry-delay:
description: 'Time to wait (in ms) between retries'
default: 0
required: false
We're currently running a fork with this is implemented, but I would prefer to clean it up and create a proper PR if you're interested?
This issue is idle because it has been open for 14 days with no activity.
Hello @jooooel! Can you elaborate on what the "highly concurrent setup" means? Is this in reference to the runner or the actual workflow? I'm trying to understand the cause of the error. If the RemoteDisconnected
error comes from the azure/aks-set-context
action (in a proper environment) I agree that we need to add retries. If the "highly concurrent setup" is what's causing the errors, instead of azure/aks-set-context
, I'm not sure it makes sense to add here.
If the problem is coming from an improper environment configuration (thus making the error expected), it makes sense to handle the error at the level causing that misconfiguration. This could done with one of the solutions detailed in this article (approach 2 or 3).
Hi @OliverMKing! I'm not really sure I understand what you mean the differences are. Here's a snippet of the yaml (I have removed irrelevant parts):
jobs:
reconciliation:
name: "Reconciliation"
runs-on: "ubuntu-latest"
strategy:
fail-fast: false # We want to know if there are more than one failing job when we fail
matrix:
kustomizations: ${{ fromJSON(inputs.updated_kustomizations) }} # This is a list of 30-40 items
steps:
- name: "Set kubectl context"
uses: jooooel/aks-set-context@main
with:
cluster-name: "${{ inputs.aks_cluster }}"
resource-group: "${{ inputs.resource_group }}"
admin: "false"
use-kubelogin: "true"
The error does come from the azure/aks-set-context
action. As you can see from the yaml we're using it in a job that is part of a matrix (with about 30-40 jobs). They all try to connect to the same cluster, performing different kinds of validations. I don't think all of them are running at the exact same time, but enough of them are for us to get these connection errors from time to time.
My guess it might be related to some throttling or network issues on the AKS side, but I've been unable to verify that. I've been in contact with Azure support to try to figure out the root cause of the connection issues, but they were of no help unfortunately.
I understand your hesitation to put this into the action, and we are currently fine running our fork (where I added retries) if that's the case.
This issue is idle because it has been open for 14 days with no activity.