architecture-center
architecture-center copied to clipboard
Update commands used for TS in troubleshoot-network-aks.md
Describe the summary, scope, and intent of this PR:
The current commands are outdated:
- Path not correct (refer to kubenet, aka scenario used in this article)
- AKS is no longer using
docker
butcontainerd
. Commands needs to be modified.
Insert links(s) to any related work item(s) or supporting detail:
kubernetes support for Docker via dockershim is now removed https://kubernetes.io/blog/2020/12/02/dont-panic-kubernetes-and-docker/
The change to point 1 includes:
- Even the original command in the old environment is wrong, because if you execute that command, you can see: /var/lib/cni/networks/k8s-pod-network# ls -la total 56 drwxr-xr-x 2 root root 4096 Mar 5 11:34 . drwxr-xr-x 3 root root 4096 Feb 8 04:38 .. -rw-r--r-- 1 root root 70 Mar 5 00:46 198.18.1.17 -rw-r--r-- 1 root root 70 Mar 5 00:46 198.18.1.18 -rw-r--r-- 1 root root 70 Mar 5 00:46 198.18.1.19 -rw-r--r-- 1 root root 70 Mar 5 00:46 198.18.1.20 -rw-r--r-- 1 root root 70 Mar 5 00:46 198.18.1.21 -rw-r--r-- 1 root root 70 Mar 5 00:46 198.18.1.23 -rw-r--r-- 1 root root 70 Mar 5 00:46 198.18.1.25 -rw-r--r-- 1 root root 70 Mar 5 00:46 198.18.1.27 -rw-r--r-- 1 root root 70 Mar 5 00:46 198.18.1.30 -rw-r--r-- 1 root root 70 Mar 5 00:46 198.18.1.31 -rw-r--r-- 1 root root 70 Mar 5 00:46 198.18.1.32 -rw-r--r-- 1 root root 12 Mar 5 11:34 last_reserved_ip.0 -rwxr-x--- 1 root root 0 Feb 8 04:38 lock
The line drwxr-xr-x 2 root root 4096 Mar 5 11:34 .
and line drwxr-xr-x 3 root root 4096 Feb 8 04:38 ..
are included, which makes IP numbers increased by 2 (not correct number).
So based on current environment, I exclude the lines, including: (ends with) .
, last
and lock
in order to get the correct number of IP allocated. This should be fair enough.
Some outputs for result integrity check (for point 2):
kubectl get pods --field-selector spec.nodeName=aks-userpool-25748257-vmss000000,status.phase=Running -A -o json | jq -r '.items[] | select(.spec.hostNetwork != 'true').status.podIP'
198.18.1.25
198.18.1.23
198.18.1.27
198.18.1.20
198.18.1.30
198.18.1.18
198.18.1.21
198.18.1.31
198.18.1.19
198.18.1.17
198.18.1.32
kubectl get pods --field-selector spec.nodeName=aks-userpool-25748257-vmss000000,status.phase=Running -A -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
calico-system calico-kube-controllers-776b76df8f-lmbvk 1/1 Running 0 25h 198.18.1.25 aks-userpool-25748257-vmss000000 <none> <none>
calico-system calico-node-xwpb6 1/1 Running 0 8h 10.2.0.4 aks-userpool-25748257-vmss000000 <none> <none>
calico-system calico-typha-7b8cf7bb4b-9j727 1/1 Running 0 25h 10.2.0.4 aks-userpool-25748257-vmss000000 <none> <none>
default test-anykap-6xp8v 1/1 Running 5 (10h ago) 5d2h 10.2.0.4 aks-userpool-25748257-vmss000000 <none> <none>
gatekeeper-system gatekeeper-audit-59875b6cdc-gtwtq 1/1 Running 0 25h 198.18.1.23 aks-userpool-25748257-vmss000000 <none> <none>
gatekeeper-system gatekeeper-controller-58498fccdc-d48ck 1/1 Running 0 25h 198.18.1.27 aks-userpool-25748257-vmss000000 <none> <none>
kube-system ama-logs-tqkm5 3/3 Running 21 (10h ago) 7d9h 198.18.1.20 aks-userpool-25748257-vmss000000 <none> <none>
kube-system ama-metrics-node-jkhbk 2/2 Running 58 (10h ago) 26d 198.18.1.30 aks-userpool-25748257-vmss000000 <none> <none>
kube-system azure-policy-6664f4bd9d-djsv5 1/1 Running 0 25h 198.18.1.18 aks-userpool-25748257-vmss000000 <none> <none>
kube-system azure-policy-webhook-7f584845c-p8lcm 1/1 Running 0 25h 198.18.1.21 aks-userpool-25748257-vmss000000 <none> <none>
kube-system cloud-node-manager-5b47d 1/1 Running 10 (10h ago) 12d 10.2.0.4 aks-userpool-25748257-vmss000000 <none> <none>
kube-system coredns-789789675-hp75n 1/1 Running 0 25h 198.18.1.31 aks-userpool-25748257-vmss000000 <none> <none>
kube-system csi-azuredisk-node-4jnw5 3/3 Running 63 (10h ago) 26d 10.2.0.4 aks-userpool-25748257-vmss000000 <none> <none>
kube-system csi-azurefile-node-qqtc8 3/3 Running 63 (10h ago) 26d 10.2.0.4 aks-userpool-25748257-vmss000000 <none> <none>
kube-system konnectivity-agent-896ffc9db-s4kls 1/1 Running 0 25h 198.18.1.19 aks-userpool-25748257-vmss000000 <none> <none>
kube-system kube-proxy-ndqzc 1/1 Running 21 (10h ago) 26d 10.2.0.4 aks-userpool-25748257-vmss000000 <none> <none>
kube-system metrics-server-6df4669546-5996g 2/2 Running 0 25h 198.18.1.17 aks-userpool-25748257-vmss000000 <none> <none>
kube-system metrics-server-6df4669546-r94ld 2/2 Running 0 25h 198.18.1.32 aks-userpool-25748257-vmss000000 <none> <none>
tigera-operator tigera-operator-77bd6c5f5-nwz9d 1/1 Running 0 10h 10.2.0.4 aks-userpool-25748257-vmss000000 <none> <none>
Conclusion: The commands provided in this PR can exclude the Pods using hostNetwork, which should be excluded, to get correct number of running Pod IPs for troubleshooting IP allocation issue.
Check if command can be executed:
kubectl get pods --field-selector spec.nodeName=aks-userpool-25748257-vmss000000,status.phase=Running -A -o json | jq -r '.items[] | select(.spec.hostNetwork != 'true').status.podIP' | wc -l
11
AFTER YOUR PR HAS BEEN CREATED, expand this section for tips and additional instructions.
These are common guidelines for contributions across the repos managed by the Cloud Architecture Content Team (CACT). Some repositories may have additional specific requirements that are not listed here.
Guidance for all contributors
Topic | Guidance |
---|---|
Draft PR | If your PR will be a work-in-progress for more than a day or two, select the Convert to draft link in the upper right of the page (under Reviewers) to change it to a draft. For future reference, you can also do this using the Create pull request button drop-down during PR creation. |
ms.date metadata |
|
Placement and linking | If you're creating a new article or articles, include updates to the related TOC.yml file to propose where the article(s) should be placed. Also consider other places within the document set where it would be beneficial to cross-reference and link to your new article(s). |
PR build | After you open your PR, and for each successive commit that you push to your branch, the publishing platform will run validation on the files in your pull request. A summary of the build results for each file will be inserted inline into your pull request, which includes any build suggestions/warnings/errors. PRs cannot be merged until all build errors and most warnings are resolved. |
Publishing | Following a successful merge, most repos publish to the live site at least once per (business) day, usually around 10am Pacific. |
Additional resources |
Additional guidance for private repos and internal contributors
Topic | Guidance |
---|---|
PR size | If your PR is more than ~5 lines of changes, or you'd like for the changes to go through editorial or larger review, open a contribution request at https://aka.ms/Contribution and include a link to the PR in response #8. Once it's processed, you'll be notified of the next steps. |
PR title prefix | Select the Edit button to the right of the PR title if you need to revise it. The following prefixes are reserved for specific contribution types:
|
PR preview | Following successful build of your PR, publishable files will also include Preview URL links to staged previews of your new/updated articles. Be sure to review these for verification of your intended contributions, or to send to other internal contributors for review. |
PR sign-off (public repo) | If an article you own is updated in a public repo PR, you are responsible for sign-off. You will be automatically notified via email. The PR will not be merged until you've had a chance to review and sign-off. |
PR sign-off (private repo) | After you've completed your proposed changes, addressed build warnings, and completed all review work, you can begin the sign-off process for review and merge:
|
Additional resources |
|
@JoeyC-Dev : Thanks for your contribution! The author(s) have been notified to review your proposed change.
Learn Build status updates of commit 6dcabc0:
:white_check_mark: Validation status: passed
File | Status | Preview URL | Details |
---|---|---|---|
docs/operator-guides/aks/troubleshoot-network-aks.md | :white_check_mark:Succeeded |
For more details, please refer to the build report.
For any questions, please:
- Try searching the learn.microsoft.com contributor guides
- Post your question in the Learn support channel
@mosabami
- Can you review this PR?
- IMPORTANT: When this content is ready to merge, you must add
#sign-off
in a comment or the approval may get overlooked.
fyi @MicrosoftDocs/patterns-and-practices-team-pr-reviewers
#label:"aq-pr-triaged" @MicrosoftDocs/public-repo-pr-review-team
Updates: why there are two kubenet
AKS folder location
a. Create AKS with preset Test/Dev
(but disabled Calico)
ls /var/lib/cni/
networks results
ls /var/lib/cni/networks/kubenet/
10.244.1.2 10.244.1.3 last_reserved_ip.0 lock
b. Create AKS with preset Prod
ls /var/lib/cni/networks/
k8s-pod-network
c. Create AKS with preset Test/Dev
(but enabled Calico)
ls /var/lib/cni/networks/
k8s-pod-network
So it looks like calico
will make the folder name change.
Hence, I did some changes:
root@aks-agentpool-13518959-vmss000000:/# cd "/var/lib/cni/networks/$(ls /var/lib/cni/networks/ | grep -e "k8s-pod-network" -e "kubenet")"
root@aks-agentpool-13518959-vmss000000:/var/lib/cni/networks/k8s-pod-network#
Fair enough.
Additional: results for others (same for all CNI, CNI Overlay, CNI Podsubnet):
ls /var/lib/cni/
results
Learn Build status updates of commit fd259d3:
:white_check_mark: Validation status: passed
File | Status | Preview URL | Details |
---|---|---|---|
docs/operator-guides/aks/troubleshoot-network-aks.md | :white_check_mark:Succeeded |
For more details, please refer to the build report.
For any questions, please:
- Try searching the learn.microsoft.com contributor guides
- Post your question in the Learn support channel
This comment is referred as justification (or base) of the changes.
Section 1: Additional verification for the method in kubenet
First, manually create two Pods with ImagePullBackOff
error.
As you see, it allocates 2 IPs but the Pods are not running.
Next, create a busybox deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: busybox-deployment
spec:
replicas: 230
selector:
matchLabels:
app: busybox-app
template:
metadata:
labels:
app: busybox-app
spec:
containers:
- name: busybox
image: busybox
command:
- /bin/sh
- -c
- sleep 3600
Verify from
kubectl
(w/ hostNetwork):
kubectl get pods --field-selector spec.nodeName=aks-agentpool-29718331-vmss000002,status.phase=Running -A -o json | jq -r '.items[] | .status.podIP' | wc -l
248
kubectl
result (w/o hostNetwork):
kubectl get pods --field-selector spec.nodeName=aks-agentpool-29718331-vmss000002,status.phase=Running -A -o json | jq -r '.items[] | select(.spec.hostNetwork != 'true').status.podIP' | wc -l
242
(Till here: we prove that the Running Pod only takes 242 IPs)
Next:
ls -la "/var/lib/cni/networks/$(ls /var/lib/cni/networks/ | grep -e "k8s-pod-network" -e "kubenet")" | grep -v -e "lock\|last\|total" -e '\.$' | more | wc -l
244
Note: the keyword total
needs also to be excluded because:
ls -la "/var/lib/cni/networks/$(ls /var/lib/cni/networks/ | grep -e "k8s-pod-network" -e "kubenet")" | more
total 996
drwxr-xr-x 2 root root 12288 Mar 11 06:22 .
drwxr-xr-x 3 root root 4096 Mar 11 00:59 ..
-rw-r--r-- 1 root root 70 Mar 11 01:00 10.244.0.10
Since 242
+ 2
= 244
, the calculation approach is correct. Now let's delete the busybox
Pods.
kubectl get pods --field-selector spec.nodeName=aks-agentpool-29718331-vmss000002,status.phase=Running -A -o json | jq -r '.items[] | select(.spec.hostNetwork != 'true').status.podIP' | wc -l
15
ls -la "/var/lib/cni/networks/$(ls /var/lib/cni/networks/ | grep -e "k8s-pod-network" -e "kubenet")" | grep -v -e "lock\|last\|total" -e '\.$' | wc -l
17
Still correct. Because still 2
error Pods are using the IPs. Now delete the error nginx
Pods.
ls -la "/var/lib/cni/networks/$(ls /var/lib/cni/networks/ | grep -e "k8s-pod-network" -e "kubenet")" | grep -v -e "lock\|last\|total" -e '\.$' | more | wc -l
15
Result correct. Proven that this is a way to check and the results are correct.
Section 2: Azure CNI
However, since under kubenet
architecture, there is no API way to check (Kubernetes did not implement any). So we add the Podsubnet
AKS scenario to check the allocated IP via API.
For example:
kubectl get nnc -n kube-system -o wide
NAME REQUESTED IPS ALLOCATED IPS SUBNET SUBNET CIDR NC ID NC MODE NC TYPE NC VERSION
aks-agentpool-12345678-vmss000000 32 32 subnet 10.18.0.0/15 559e239d-f744-4f84-bbe0-c7c6fd12ec17 dynamic vnet 1
Then checking current running Pods:
kubectl get pods --field-selector spec.nodeName=aks-agentpool-12345678-vmss000000,status.phase=Running -A -o json | jq -r '.items[] | select(.spec.hostNetwork != 'true').status.podIP' | wc -l
17
In the past, I faced a user whose allocated IPs and requested IPs are both 32, but running Pods only has 21 (or what else) but cannot create any more Pods.
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox 'ac1b1354613465324654c1588ac64f1a756aa32f14732246ac4132133ba21364': plugin type='azure-vnet' failed (add): IPAM Invoker Add failed with error: Failed to get IP address from CNS with error: %w: AllocateIPConfig failed: not enough IPs available for 9c6a7f37-dd43-4f7c-a01f-1ff41653609c, waiting on Azure CNS to allocate more with NC Status: , IP config request is [IPConfigRequest: DesiredIPAddress , PodInterfaceID a1876957-eth0, InfraContainerID a1231464635654a123646565456cc146841c1313546a515432161a45a5316541, OrchestratorContext {'PodName':'a_podname','PodNamespace':'my_namespace'}]
This can be referred as proven that there is a bug in CNI then ask user to submit ticket.
Learn Build status updates of commit 6eee1b3:
:white_check_mark: Validation status: passed
File | Status | Preview URL | Details |
---|---|---|---|
docs/operator-guides/aks/troubleshoot-network-aks.md | :white_check_mark:Succeeded |
For more details, please refer to the build report.
For any questions, please:
- Try searching the learn.microsoft.com contributor guides
- Post your question in the Learn support channel
Learn Build status updates of commit 8c49776:
:white_check_mark: Validation status: passed
File | Status | Preview URL | Details |
---|---|---|---|
docs/operator-guides/aks/troubleshoot-network-aks.md | :white_check_mark:Succeeded |
For more details, please refer to the build report.
For any questions, please:
- Try searching the learn.microsoft.com contributor guides
- Post your question in the Learn support channel
Learn Build status updates of commit 3a56525:
:white_check_mark: Validation status: passed
File | Status | Preview URL | Details |
---|---|---|---|
docs/operator-guides/aks/troubleshoot-network-aks.md | :white_check_mark:Succeeded |
For more details, please refer to the build report.
For any questions, please:
- Try searching the learn.microsoft.com contributor guides
- Post your question in the Learn support channel
PnP #sign-off
#remove-label:"pnp-review-in-progress" #remove-label:"do-not-merge" #label:"ready-to-merge"
Invalid command: '#sign-off'. Only the assigned author of one or more file in this PR can sign off. @mosabami