cluster-api-provider-azure icon indicating copy to clipboard operation
cluster-api-provider-azure copied to clipboard

cloud-node-manager-windows in Crashloopbackoff

Open himanshuz2 opened this issue 3 years ago • 5 comments

/kind bug

[Before submitting an issue, have you checked the Troubleshooting Guide?]

What steps did you take and what happened: Deployed the Azure Cloud Provider helm install --repo https://raw.githubusercontent.com/kubernetes-sigs/cloud-provider-azure/master/helm/repo cloud-provider-azure --generate-name --set infra.clusterName=${CLUSTER_NAME}

cloud-node-manager-windows-n4hks 0/1 CrashLoopBackOff 38 (3h8m ago)

What did you expect to happen:

cloud-node-manager-windows runs without CrashLoopBackOff Anything else you would like to add: Here is the description of POD and logs Name: cloud-node-manager-windows-n4hks Namespace: kube-system Priority: 2000001000 Priority Class Name: system-node-critical Node: win-p-win000000/10.1.0.6 Start Time: Wed, 01 Jun 2022 10:22:41 -0400 Labels: controller-revision-hash=5dd46f6bfb k8s-app=cloud-node-manager-windows pod-template-generation=1 Annotations: cluster-autoscaler.kubernetes.io/daemonset-pod: true cni.projectcalico.org/containerID: 8bf61dbeba6c135c9de54edfbf422ffc5fee6a353502c941600f7727cd0f9414 cni.projectcalico.org/podIP: 192.168.152.140/32 cni.projectcalico.org/podIPs: 192.168.152.140/32 Status: Running IP: 192.168.152.140 IPs: IP: 192.168.152.140 Controlled By: DaemonSet/cloud-node-manager-windows Containers: cloud-node-manager: Container ID: containerd://233de6e3a448a37782bede5e244b325fb46a18862857260c7ddb79805641b47b Image: mcr.microsoft.com/oss/kubernetes/azure-cloud-node-manager:v1.23.11 Image ID: mcr.microsoft.com/oss/kubernetes/azure-cloud-node-manager@sha256:075ea1f8270312350f1396ab6677251e803e61a523822d5abfa5e6acd180cfab Port: Host Port: Command: /cloud-node-manager.exe Args: --node-name=$(NODE_NAME) State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 255 Started: Thu, 02 Jun 2022 13:05:36 -0400 Finished: Thu, 02 Jun 2022 13:05:46 -0400 Ready: False Restart Count: 38 Limits: cpu: 2 memory: 512Mi Requests: cpu: 50m memory: 50Mi Environment: NODE_NAME: (v1:spec.nodeName) Mounts: /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-98zvh (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: kube-api-access-98zvh: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true QoS Class: Burstable Node-Selectors: kubernetes.io/os=windows Tolerations: :NoExecute op=Exists :NoSchedule op=Exists CriticalAddonsOnly op=Exists node.kubernetes.io/disk-pressure:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists node.kubernetes.io/pid-pressure:NoSchedule op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists node.kubernetes.io/unschedulable:NoSchedule op=Exists Events: Type Reason Age From Message


Warning BackOff 3h12m (x545 over 5h16m) kubelet Back-off restarting failed container root@CAPZ-Management:/home/bmadministrator/.kube# kubectl --kubeconfig=config logs cloud-node-manager-windows-n4hks -n kube-system

Failed to wait for apiserver being healthy: timed out waiting for the condition: failed to get apiserver /healthz status: Get "https://10.96.0.1:443/healthz": dial tcp 10.96.0.1:443: connectex: A socket operation was attempted to an unreachable network.

Environment:

  • cluster-api-provider-azure version: cluster.x-k8s.io/v1beta1 1.3.1
  • Kubernetes version: (use kubectl version): 1.23.6
  • OS (e.g. from /etc/os-release): Control Plan Ubuntu Jammy Jellyfish but the node OS is Windows server 2019 cncf image

himanshuz2 avatar Jun 02 '22 20:06 himanshuz2

https://github.com/kubernetes-sigs/cloud-provider-azure/issues/1807

himanshuz2 avatar Jun 02 '22 20:06 himanshuz2

kubernetes-sigs/cloud-provider-azure is regularly testing out-of-tree w/ both Linux and Windows.

@lzhecheng @nilo19 are you regularly seeing any of the above symptoms in any tests?

jackfrancis avatar Jun 23 '22 15:06 jackfrancis

Is it because of Calico CNI ?

himanshuz2 avatar Jun 23 '22 15:06 himanshuz2

@jackfrancis Actually, those windows tests are still blocked by calico installation failure now... We will check if such situations happen after your "helm install calico" PR is merged. Besides, so far we are not aware of such situation.

lzhecheng avatar Jun 29 '22 02:06 lzhecheng

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Sep 27 '22 03:09 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Oct 27 '22 03:10 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Nov 26 '22 04:11 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Nov 26 '22 04:11 k8s-ci-robot