AKS icon indicating copy to clipboard operation
AKS copied to clipboard

[BUG] Pods are not getting created with AKS in windows with Cilium as the CNI even though the max pods is set to 250

Open amitmavgupta opened this issue 2 years ago • 4 comments

Describe the bug

  1. AKS on Windows with Cilium as the CNI. CNI mode being aksbyocni.
  2. A cluster is brought up on AKS with Windows.

az provider show -n Microsoft.OperationsManagement -o table az provider show -n Microsoft.OperationalInsights -o table

az provider register --namespace Microsoft.OperationsManagement az provider register --namespace Microsoft.OperationalInsights

az group create --name myResourceGroup --location eastus

echo "Please enter the username to use as administrator credentials for Windows Server nodes on your cluster: " && read WINDOWS_USERNAME


echo "Please enter the password to use as administrator credentials for Windows Server nodes on your cluster: " && read WINDOWS_PASSWORD


az aks create
--resource-group myResourceGroup
--name cluster1
--node-count 2
--enable-addons monitoring
--generate-ssh-keys
--windows-admin-username $WINDOWS_USERNAME
--windows-admin-password $WINDOWS_PASSWORD
--vm-set-type VirtualMachineScaleSets
--network-plugin none

az aks nodepool add
--resource-group myResourceGroup
--cluster-name cluster1
--os-type Windows
--name npwin
--node-count 1

Cilium is then installed

helm install cilium cilium/cilium --version 1.13.4
--namespace kube-system
--set aksbyocni.enabled=true
--set nodeinit.enabled=true

To Reproduce

  1. As soon as Cilium is coming up we see few containers in ContainerCreating state NAMESPACE NAME READY STATUS RESTARTS AGE kube-system ama-logs-4phmr 3/3 Running 0 17m kube-system ama-logs-gdb7j 3/3 Running 0 18m kube-system ama-logs-rs-54d8df865-9bg7n 2/2 Running 0 18m kube-system ama-logs-windows-vdjfk 0/1 ContainerCreating 0 8m38s kube-system cilium-85qt7 1/1 Running 0 2m9s kube-system cilium-node-init-brdl5 1/1 Running 0 2m9s kube-system cilium-node-init-r4prg 1/1 Running 0 2m9s kube-system cilium-nvsb9 1/1 Running 0 2m9s kube-system cilium-operator-fdc5f8984-h78jd 1/1 Running 0 2m9s kube-system cilium-operator-fdc5f8984-rzghw 1/1 Running 0 2m9s kube-system cloud-node-manager-8bv9d 1/1 Running 0 17m kube-system cloud-node-manager-k9pd4 1/1 Running 0 18m kube-system cloud-node-manager-windows-qs5v5 0/1 ContainerCreating 0 9m6s kube-system coredns-autoscaler-69b7556b86-fqlqv 1/1 Running 0 18m kube-system coredns-fb6b9d95f-lbgbh 1/1 Running 0 18m kube-system coredns-fb6b9d95f-wwr86 1/1 Running 0 68s kube-system csi-azuredisk-node-cq7zz 3/3 Running 0 18m kube-system csi-azuredisk-node-pkwjq 3/3 Running 0 17m kube-system csi-azuredisk-node-win-b98wk 0/3 ContainerCreating 0 9m6s kube-system csi-azurefile-node-v9gbm 3/3 Running 0 18m kube-system csi-azurefile-node-win-jbv5k 0/3 ContainerCreating 0 9m6s kube-system csi-azurefile-node-zm4ln 3/3 Running 0 17m kube-system konnectivity-agent-f6b459979-jm8v7 1/1 Running 0 3m31s kube-system konnectivity-agent-f6b459979-ktzx5 1/1 Running 0 3m41s kube-system kube-proxy-8hdwv 1/1 Running 0 17m kube-system kube-proxy-d4msp 1/1 Running 0 18m kube-system metrics-server-5dd7f7965f-kbz92 2/2 Running 0 73s kube-system metrics-server-5dd7f7965f-w2vkp 2/2 Running 0 73s

kubectl get pods -A | wc -l 30

###############

On Further triaging we can see that the pods are complaining of IP addresses not being available.

Warning FailedCreatePodSandBox 9m41s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "4e6946968e241b72da5c1523541f0466a5f3475b6f3083fac27660e59e60051e": plugin type="azure-vnet" failed (add): IPAM Invoker Add failed with error: Failed to allocate pool: Failed to delegate: Failed to allocate address: No available addresses

###############

From /etc/default/kubelet on the node we can see that max pods are set to 250.

KUBELET_FLAGS=--address=0.0.0.0 --anonymous-auth=false --authentication-token-webhook=true --authorization-mode=Webhook --azure-container-registry-config=/etc/kubernetes/azure.json --cgroups-per-qos=true --client-ca-file=/etc/kubernetes/certs/ca.crt --cloud-provider=external --cluster-dns=10.0.0.10 --cluster-domain=cluster.local --container-log-max-size=50M --enforce-node-allocatable=pods --event-qps=0 --eviction-hard=memory.available<750Mi,nodefs.available<10%,nodefs.inodesFree<5%,pid.available<2000 --feature-gates=CSIMigration=true,CSIMigrationAzureDisk=true,CSIMigrationAzureFile=true,DelegateFSGroupToCSIDriver=true --image-gc-high-threshold=85 --image-gc-low-threshold=80 --keep-terminated-pod-volumes=false --kube-reserved=cpu=100m,memory=1638Mi,pid=1000 --kubeconfig=/var/lib/kubelet/kubeconfig --max-pods=250 --node-status-update-frequency=10s --pod-infra-container-image=mcr.microsoft.com/oss/kubernetes/pause:3.6 --pod-manifest-path=/etc/kubernetes/manifests --protect-kernel-defaults=true --read-only-port=0 --rotate-certificates=true --streaming-connection-idle-timeout=4h --tls-cert-file=/etc/kubernetes/certs/kubeletserver.crt --tls-

###################

{ "cniVersion": "0.3.1", "name": "cilium", "type": "cilium-cni", "enable-debug": false, "log-file": "/var/run/cilium/cilium-cni.log" }

Expected behavior All pods should be up and running for further tests to be done.

Environment (please complete the following information):

  • CLI Version 2.49.0
  • Kubernetes version 1.25.6

Additional context Add any other context about the problem here.

### Tasks

amitmavgupta avatar Jun 21 '23 13:06 amitmavgupta

@amitmavgupta its currently not available for windows nodepool Azure CNI powered by Cilium currently has the following limitations: Available only for Linux and not for Windows.

https://learn.microsoft.com/en-us/azure/aks/azure-cni-powered-by-cilium#limitations

abarqawi avatar Jun 25 '23 08:06 abarqawi

@abarqawi Thanks. Any timeline for this support to be available?

amitmavgupta avatar Jul 13 '23 14:07 amitmavgupta

@amitmavgupta - Could you provide additional detail regarding your requirement for Windows support?

allyford avatar Aug 11 '23 18:08 allyford

I would also like this for windows node pools. Requirement is just that it supports them. Currently if a cluster has windows nodes you can't move to cilium. Would just like to know if it's on a future roadmap or is never going to happen.

derek-andrews-work avatar Jan 29 '24 16:01 derek-andrews-work

This issue has been automatically marked as stale because it has not had any activity for 180 days. It will be closed if no further activity occurs within 7 days of this comment. @allyford

This issue will now be closed because it hasn't had any activity for 7 days after stale. @amitmavgupta feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.