architecture-center icon indicating copy to clipboard operation
architecture-center copied to clipboard

Update commands used for TS in troubleshoot-network-aks.md

Open JoeyC-Dev opened this issue 11 months ago • 4 comments

Describe the summary, scope, and intent of this PR:
The current commands are outdated:

  1. Path not correct (refer to kubenet, aka scenario used in this article)
  2. AKS is no longer using docker but containerd. Commands needs to be modified.

Insert links(s) to any related work item(s) or supporting detail:

kubernetes support for Docker via dockershim is now removed https://kubernetes.io/blog/2020/12/02/dont-panic-kubernetes-and-docker/

The change to point 1 includes:

  1. Even the original command in the old environment is wrong, because if you execute that command, you can see: /var/lib/cni/networks/k8s-pod-network# ls -la total 56 drwxr-xr-x 2 root root 4096 Mar 5 11:34 . drwxr-xr-x 3 root root 4096 Feb 8 04:38 .. -rw-r--r-- 1 root root 70 Mar 5 00:46 198.18.1.17 -rw-r--r-- 1 root root 70 Mar 5 00:46 198.18.1.18 -rw-r--r-- 1 root root 70 Mar 5 00:46 198.18.1.19 -rw-r--r-- 1 root root 70 Mar 5 00:46 198.18.1.20 -rw-r--r-- 1 root root 70 Mar 5 00:46 198.18.1.21 -rw-r--r-- 1 root root 70 Mar 5 00:46 198.18.1.23 -rw-r--r-- 1 root root 70 Mar 5 00:46 198.18.1.25 -rw-r--r-- 1 root root 70 Mar 5 00:46 198.18.1.27 -rw-r--r-- 1 root root 70 Mar 5 00:46 198.18.1.30 -rw-r--r-- 1 root root 70 Mar 5 00:46 198.18.1.31 -rw-r--r-- 1 root root 70 Mar 5 00:46 198.18.1.32 -rw-r--r-- 1 root root 12 Mar 5 11:34 last_reserved_ip.0 -rwxr-x--- 1 root root 0 Feb 8 04:38 lock

The line drwxr-xr-x 2 root root 4096 Mar 5 11:34 . and line drwxr-xr-x 3 root root 4096 Feb 8 04:38 .. are included, which makes IP numbers increased by 2 (not correct number).

So based on current environment, I exclude the lines, including: (ends with) ., last and lock in order to get the correct number of IP allocated. This should be fair enough.

Some outputs for result integrity check (for point 2):

kubectl get pods --field-selector spec.nodeName=aks-userpool-25748257-vmss000000,status.phase=Running -A -o json | jq -r '.items[] | select(.spec.hostNetwork != 'true').status.podIP'
198.18.1.25
198.18.1.23
198.18.1.27
198.18.1.20
198.18.1.30
198.18.1.18
198.18.1.21
198.18.1.31
198.18.1.19
198.18.1.17
198.18.1.32
kubectl get pods --field-selector spec.nodeName=aks-userpool-25748257-vmss000000,status.phase=Running -A -o wide
NAMESPACE           NAME                                       READY   STATUS    RESTARTS       AGE    IP            NODE                               NOMINATED NODE   READINESS GATES
calico-system       calico-kube-controllers-776b76df8f-lmbvk   1/1     Running   0              25h    198.18.1.25   aks-userpool-25748257-vmss000000   <none>           <none>
calico-system       calico-node-xwpb6                          1/1     Running   0              8h     10.2.0.4      aks-userpool-25748257-vmss000000   <none>           <none>
calico-system       calico-typha-7b8cf7bb4b-9j727              1/1     Running   0              25h    10.2.0.4      aks-userpool-25748257-vmss000000   <none>           <none>
default             test-anykap-6xp8v                          1/1     Running   5 (10h ago)    5d2h   10.2.0.4      aks-userpool-25748257-vmss000000   <none>           <none>
gatekeeper-system   gatekeeper-audit-59875b6cdc-gtwtq          1/1     Running   0              25h    198.18.1.23   aks-userpool-25748257-vmss000000   <none>           <none>
gatekeeper-system   gatekeeper-controller-58498fccdc-d48ck     1/1     Running   0              25h    198.18.1.27   aks-userpool-25748257-vmss000000   <none>           <none>
kube-system         ama-logs-tqkm5                             3/3     Running   21 (10h ago)   7d9h   198.18.1.20   aks-userpool-25748257-vmss000000   <none>           <none>
kube-system         ama-metrics-node-jkhbk                     2/2     Running   58 (10h ago)   26d    198.18.1.30   aks-userpool-25748257-vmss000000   <none>           <none>
kube-system         azure-policy-6664f4bd9d-djsv5              1/1     Running   0              25h    198.18.1.18   aks-userpool-25748257-vmss000000   <none>           <none>
kube-system         azure-policy-webhook-7f584845c-p8lcm       1/1     Running   0              25h    198.18.1.21   aks-userpool-25748257-vmss000000   <none>           <none>
kube-system         cloud-node-manager-5b47d                   1/1     Running   10 (10h ago)   12d    10.2.0.4      aks-userpool-25748257-vmss000000   <none>           <none>
kube-system         coredns-789789675-hp75n                    1/1     Running   0              25h    198.18.1.31   aks-userpool-25748257-vmss000000   <none>           <none>
kube-system         csi-azuredisk-node-4jnw5                   3/3     Running   63 (10h ago)   26d    10.2.0.4      aks-userpool-25748257-vmss000000   <none>           <none>
kube-system         csi-azurefile-node-qqtc8                   3/3     Running   63 (10h ago)   26d    10.2.0.4      aks-userpool-25748257-vmss000000   <none>           <none>
kube-system         konnectivity-agent-896ffc9db-s4kls         1/1     Running   0              25h    198.18.1.19   aks-userpool-25748257-vmss000000   <none>           <none>
kube-system         kube-proxy-ndqzc                           1/1     Running   21 (10h ago)   26d    10.2.0.4      aks-userpool-25748257-vmss000000   <none>           <none>
kube-system         metrics-server-6df4669546-5996g            2/2     Running   0              25h    198.18.1.17   aks-userpool-25748257-vmss000000   <none>           <none>
kube-system         metrics-server-6df4669546-r94ld            2/2     Running   0              25h    198.18.1.32   aks-userpool-25748257-vmss000000   <none>           <none>
tigera-operator     tigera-operator-77bd6c5f5-nwz9d            1/1     Running   0              10h    10.2.0.4      aks-userpool-25748257-vmss000000   <none>           <none>

Conclusion: The commands provided in this PR can exclude the Pods using hostNetwork, which should be excluded, to get correct number of running Pod IPs for troubleshooting IP allocation issue.

Check if command can be executed:

kubectl get pods --field-selector spec.nodeName=aks-userpool-25748257-vmss000000,status.phase=Running -A -o json | jq -r '.items[] | select(.spec.hostNetwork != 'true').status.podIP' | wc -l
11
AFTER YOUR PR HAS BEEN CREATED, expand this section for tips and additional instructions.

These are common guidelines for contributions across the repos managed by the Cloud Architecture Content Team (CACT). Some repositories may have additional specific requirements that are not listed here.

Guidance for all contributors

Topic Guidance
Draft PR If your PR will be a work-in-progress for more than a day or two, select the Convert to draft link in the upper right of the page (under Reviewers) to change it to a draft. For future reference, you can also do this using the Create pull request button drop-down during PR creation.
ms.date metadata
  • Don't update an article's "ms.date" metadata property unless you've done a full freshness review of the content. A full freshness review includes changes required to correct or improve the full technical accuracy of the article.
  • Don't update "ms.date" if you're doing targeted changes to improve non-technical aspects of the article, such as editorial quality, art improvements, article template alignment, etc.
  • If you've changed any "ms.date" properties for work that wasn't part of full review for freshness, please reset them to their previous value.
Placement and linking If you're creating a new article or articles, include updates to the related TOC.yml file to propose where the article(s) should be placed. Also consider other places within the document set where it would be beneficial to cross-reference and link to your new article(s).
PR build After you open your PR, and for each successive commit that you push to your branch, the publishing platform will run validation on the files in your pull request. A summary of the build results for each file will be inserted inline into your pull request, which includes any build suggestions/warnings/errors. PRs cannot be merged until all build errors and most warnings are resolved.
Publishing Following a successful merge, most repos publish to the live site at least once per (business) day, usually around 10am Pacific.
Additional resources

Additional guidance for private repos and internal contributors

Topic Guidance
PR size If your PR is more than ~5 lines of changes, or you'd like for the changes to go through editorial or larger review, open a contribution request at https://aka.ms/Contribution and include a link to the PR in response #8. Once it's processed, you'll be notified of the next steps.
PR title prefix Select the Edit button to the right of the PR title if you need to revise it. The following prefixes are reserved for specific contribution types:

  • [Quality Check] - maintenance work related to content quality (edit passes, art improvements, template alignment)
  • [LinkFix] - recurring/adhoc PRs to correct link URLs
  • [Pipeline] - new/updated contributor success pipeline content
  • [WIP] - a work-in-progress draft requiring several days/weeks
PR preview Following successful build of your PR, publishable files will also include Preview URL links to staged previews of your new/updated articles. Be sure to review these for verification of your intended contributions, or to send to other internal contributors for review.
PR sign-off (public repo) If an article you own is updated in a public repo PR, you are responsible for sign-off. You will be automatically notified via email. The PR will not be merged until you've had a chance to review and sign-off.
PR sign-off (private repo) After you've completed your proposed changes, addressed build warnings, and completed all review work, you can begin the sign-off process for review and merge:

  1. If your PR is in draft mode, remove "[WIP]" from the title and select Ready for review button at the bottom of the PR.
  2. Enter "#sign-off" in a new comment. This comment indicates that you're confident the work meets or exceeds Microsoft's standards for publication, and will trigger the review process.
  3. Your PR may be selected for initial review by the CACT. Following CACT review, you may receive questions or requests for additional changes. You should have initial feedback from CACT review within a few business days. If you have an urgent request or need to contact the team, please mention @MicrosoftDocs/cloud-architecture-content-team-pr-reviewers in your PR and someone will get back to you. After CACT review is complete, a CACT #sign-off will be added.
  4. Final review/merge is done by the PR review team. The PR team may also respond with feedback, categorized as "Blocking" (requires action from you), or "Non-blocking" (to be addressed in a future PR).
Additional resources

JoeyC-Dev avatar Mar 05 '24 11:03 JoeyC-Dev

@JoeyC-Dev : Thanks for your contribution! The author(s) have been notified to review your proposed change.

prmerger-automator[bot] avatar Mar 05 '24 11:03 prmerger-automator[bot]

Learn Build status updates of commit 6dcabc0:

:white_check_mark: Validation status: passed

File Status Preview URL Details
docs/operator-guides/aks/troubleshoot-network-aks.md :white_check_mark:Succeeded

For more details, please refer to the build report.

For any questions, please:

@mosabami

  • Can you review this PR?
  • IMPORTANT: When this content is ready to merge, you must add #sign-off in a comment or the approval may get overlooked.

fyi @MicrosoftDocs/patterns-and-practices-team-pr-reviewers

#label:"aq-pr-triaged" @MicrosoftDocs/public-repo-pr-review-team

Jak-MS avatar Mar 05 '24 18:03 Jak-MS

Updates: why there are two kubenet AKS folder location a. Create AKS with preset Test/Dev (but disabled Calico)

ls /var/lib/cni/
networks  results
ls /var/lib/cni/networks/kubenet/
10.244.1.2  10.244.1.3  last_reserved_ip.0  lock

b. Create AKS with preset Prod

ls /var/lib/cni/networks/                
k8s-pod-network

c. Create AKS with preset Test/Dev (but enabled Calico)

ls /var/lib/cni/networks/
k8s-pod-network

So it looks like calico will make the folder name change. Hence, I did some changes:

root@aks-agentpool-13518959-vmss000000:/# cd "/var/lib/cni/networks/$(ls /var/lib/cni/networks/ | grep -e "k8s-pod-network" -e "kubenet")"
root@aks-agentpool-13518959-vmss000000:/var/lib/cni/networks/k8s-pod-network# 

Fair enough.

Additional: results for others (same for all CNI, CNI Overlay, CNI Podsubnet):

ls /var/lib/cni/   
results

JoeyC-Dev avatar Mar 06 '24 12:03 JoeyC-Dev

Learn Build status updates of commit fd259d3:

:white_check_mark: Validation status: passed

File Status Preview URL Details
docs/operator-guides/aks/troubleshoot-network-aks.md :white_check_mark:Succeeded

For more details, please refer to the build report.

For any questions, please:

This comment is referred as justification (or base) of the changes.

Section 1: Additional verification for the method in kubenet

First, manually create two Pods with ImagePullBackOff error. image

As you see, it allocates 2 IPs but the Pods are not running.

Next, create a busybox deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: busybox-deployment
spec:
  replicas: 230
  selector:
    matchLabels:
      app: busybox-app
  template:
    metadata:
      labels:
        app: busybox-app
    spec:
      containers:
        - name: busybox
          image: busybox
          command:
            - /bin/sh
            - -c
            - sleep 3600

image

image Verify from kubectl (w/ hostNetwork):

kubectl get pods --field-selector spec.nodeName=aks-agentpool-29718331-vmss000002,status.phase=Running -A -o json | jq -r '.items[] | .status.podIP' | wc -l
248

kubectl result (w/o hostNetwork):

 kubectl get pods --field-selector spec.nodeName=aks-agentpool-29718331-vmss000002,status.phase=Running -A -o json | jq -r '.items[] | select(.spec.hostNetwork != 'true').status.podIP' | wc -l
242

(Till here: we prove that the Running Pod only takes 242 IPs)

Next:

ls -la "/var/lib/cni/networks/$(ls /var/lib/cni/networks/ | grep -e "k8s-pod-network" -e "kubenet")" | grep -v -e "lock\|last\|total" -e '\.$' | more | wc -l
244

Note: the keyword total needs also to be excluded because:

 ls -la "/var/lib/cni/networks/$(ls /var/lib/cni/networks/ | grep -e "k8s-pod-network" -e "kubenet")" | more
total 996
drwxr-xr-x 2 root root 12288 Mar 11 06:22 .
drwxr-xr-x 3 root root  4096 Mar 11 00:59 ..
-rw-r--r-- 1 root root    70 Mar 11 01:00 10.244.0.10

Since 242 + 2 = 244, the calculation approach is correct. Now let's delete the busybox Pods.

kubectl get pods --field-selector spec.nodeName=aks-agentpool-29718331-vmss000002,status.phase=Running -A -o json | jq -r '.items[] | select(.spec.hostNetwork != 'true').status.podIP' | wc -l
15
ls -la "/var/lib/cni/networks/$(ls /var/lib/cni/networks/ | grep -e "k8s-pod-network" -e "kubenet")" | grep -v -e "lock\|last\|total" -e '\.$' | wc -l
17

Still correct. Because still 2 error Pods are using the IPs. Now delete the error nginx Pods.

ls -la "/var/lib/cni/networks/$(ls /var/lib/cni/networks/ | grep -e "k8s-pod-network" -e "kubenet")" | grep -v -e "lock\|last\|total" -e '\.$' | more | wc -l
15

Result correct. Proven that this is a way to check and the results are correct.

Section 2: Azure CNI

However, since under kubenet architecture, there is no API way to check (Kubernetes did not implement any). So we add the Podsubnet AKS scenario to check the allocated IP via API.

For example:

kubectl get nnc -n kube-system -o wide
NAME                               REQUESTED IPS  ALLOCATED IPS  SUBNET  SUBNET CIDR   NC ID                                 NC MODE  NC TYPE  NC VERSION
aks-agentpool-12345678-vmss000000  32             32             subnet  10.18.0.0/15  559e239d-f744-4f84-bbe0-c7c6fd12ec17  dynamic  vnet     1

Then checking current running Pods:

kubectl get pods --field-selector spec.nodeName=aks-agentpool-12345678-vmss000000,status.phase=Running -A -o json | jq -r '.items[] | select(.spec.hostNetwork != 'true').status.podIP' | wc -l
17

In the past, I faced a user whose allocated IPs and requested IPs are both 32, but running Pods only has 21 (or what else) but cannot create any more Pods.

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox 'ac1b1354613465324654c1588ac64f1a756aa32f14732246ac4132133ba21364': plugin type='azure-vnet' failed (add): IPAM Invoker Add failed with error: Failed to get IP address from CNS with error: %w: AllocateIPConfig failed: not enough IPs available for 9c6a7f37-dd43-4f7c-a01f-1ff41653609c, waiting on Azure CNS to allocate more with NC Status: , IP config request is [IPConfigRequest: DesiredIPAddress , PodInterfaceID a1876957-eth0, InfraContainerID a1231464635654a123646565456cc146841c1313546a515432161a45a5316541, OrchestratorContext {'PodName':'a_podname','PodNamespace':'my_namespace'}]

This can be referred as proven that there is a bug in CNI then ask user to submit ticket.

JoeyC-Dev avatar Mar 11 '24 07:03 JoeyC-Dev

Learn Build status updates of commit 6eee1b3:

:white_check_mark: Validation status: passed

File Status Preview URL Details
docs/operator-guides/aks/troubleshoot-network-aks.md :white_check_mark:Succeeded

For more details, please refer to the build report.

For any questions, please:

Learn Build status updates of commit 8c49776:

:white_check_mark: Validation status: passed

File Status Preview URL Details
docs/operator-guides/aks/troubleshoot-network-aks.md :white_check_mark:Succeeded

For more details, please refer to the build report.

For any questions, please:

Learn Build status updates of commit 3a56525:

:white_check_mark: Validation status: passed

File Status Preview URL Details
docs/operator-guides/aks/troubleshoot-network-aks.md :white_check_mark:Succeeded

For more details, please refer to the build report.

For any questions, please:

PnP #sign-off

#remove-label:"pnp-review-in-progress" #remove-label:"do-not-merge" #label:"ready-to-merge"

ckittel avatar Mar 11 '24 21:03 ckittel

Invalid command: '#sign-off'. Only the assigned author of one or more file in this PR can sign off. @mosabami

prmerger-automator[bot] avatar Mar 11 '24 21:03 prmerger-automator[bot]