AKS icon indicating copy to clipboard operation
AKS copied to clipboard

[BUG] Command Invoke in private cluster creates pods that are never deleted if the command(s) doesn't go through

Open amitmavgupta opened this issue 2 years ago • 31 comments

Describe the bug Create a Private Cluster in AKS following this document https://learn.microsoft.com/en-gb/azure/aks/access-private-cluster?tabs=azure-portal

To Reproduce

If some of the commands error out like as can be seen below ( a repo had not been added which was the user's mistake)

az aks command invoke \
  --resource-group privatecluster \
  --name privatecluster \
  --command "helm install cilium cilium/cilium --version 1.13.4 --namespace kube-system --set aksbyocni.enabled=true --set nodeinit.enabled=true"
command started at 2023-10-04 12:54:50+00:00, finished at 2023-10-04 12:54:51+00:00 with exitcode=1
Error: INSTALLATION FAILED: repo cilium not found

Post fixing this error if the user were to issue ( see command below to fetch the pods state), you can notice the few pods are in Error state as the commands were not executed. This is not momentarily but the age of the pods shows that they are never deleted.

az aks command invoke \
  --resource-group privatecluster \
  --name privatecluster \
  --command "kubectl get pods -A -o wide"
command started at 2023-10-04 13:14:31+00:00, finished at 2023-10-04 13:14:32+00:00 with exitcode=0
NAMESPACE     NAME                                       READY   STATUS      RESTARTS   AGE   IP           NODE                                NOMINATED NODE   READINESS GATES
aks-command   command-0679b33e1a164f34b3e17580607a1dd6   0/1     Completed   0          20m   10.244.2.3   aks-nodepool1-26012924-vmss000002   <none>           <none>
aks-command   command-1464a85e6b8844adb9bdc8a6f60d1336   0/1     Completed   0          41s   10.0.2.213   aks-nodepool1-26012924-vmss000002   <none>           <none>
aks-command   command-278f23c16d45436b8ceca12663b6d196   0/1     Completed   0          16m   10.0.2.97    aks-nodepool1-26012924-vmss000002   <none>           <none>
aks-command   command-32a4a1e90c2748d09ee178259795917d   0/1     Error       0          19m   10.244.2.5   aks-nodepool1-26012924-vmss000002   <none>           <none>
aks-command   command-57870e11e1094f66aa1fb44466638a65   0/1     Error       0          19m   10.244.2.4   aks-nodepool1-26012924-vmss000002   <none>           <none>
aks-command   command-7d63767d93024d04ac83198d6a24695d   0/1     Completed   0          26s   10.0.2.209   aks-nodepool1-26012924-vmss000002   <none>           <none>
aks-command   command-8acea6b0a6af4bb189a3bb7bcf5cff5e   1/1     Running     0          2s    10.0.2.198   aks-nodepool1-26012924-vmss000002   <none>           <none>
aks-command   command-b77e7fcb9fac44e8a8c53e8e198bdfe1   0/1     Completed   0          18m   10.244.2.6   aks-nodepool1-26012924-vmss000002   <none>           <none>
aks-command   command-c627cfe313cc4a7c974a322e53165d25   0/1     Completed   0          18m   10.244.2.7   aks-nodepool1-26012924-vmss000002   <none>           <none>
kube-system   azure-ip-masq-agent-lbjxj                  1/1     Running     0          29m   10.224.0.7   aks-nodepool1-26012924-vmss000001   <none>           <none>
kube-system   azure-ip-masq-agent-r4frx                  1/1     Running     0          29m   10.224.0.5   aks-nodepool1-26012924-vmss000000   <none>           <none>
kube-system   azure-ip-masq-agent-wph62                  1/1     Running     0          29m   10.224.0.6   aks-nodepool1-26012924-vmss000002   <none>           <none>

Expected behavior These pods should be in Completed state else if the user were to use this feature and not eat up into the pod count that is supported via a particular Network Plugin.

Screenshots

Environment (please complete the following information):

  • CLI Version 2.53.0
  • Kubernetes version 1.26.6

Additional context

@wedaly @tamilmani1989

amitmavgupta avatar Oct 04 '23 14:10 amitmavgupta

I'm not familiar with the az aks command invoke, but I'd guess it keeps the failed pods so a user could inspect the logs? I don't believe pods in Error state count towards the max pods limit on a node.

wedaly avatar Oct 04 '23 15:10 wedaly

Ah yes, that makes sense Will. Didn't think about that.

Also, good to know that it doesn't count towards the max pods limit.

amitmavgupta avatar Oct 04 '23 15:10 amitmavgupta

@wedaly do we still need to track this just to make sure that someone can see why these pods are left hanging or it's safe to ignore?

amitmavgupta avatar Oct 06 '23 10:10 amitmavgupta

Action required from @Azure/aks-pm

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads