kops icon indicating copy to clipboard operation
kops copied to clipboard

cloud-controller-manager - CrashLoopBackOff

Open RizwanaVyoma opened this issue 8 months ago • 5 comments

I am using Kops on GCE cluster.

The recent cluster update automatically changed cloud-controller-manager image

Change log:
ManagedFile/cluster.k8s.local-addons-gcp-cloud-controller.addons.k8s.io-k8s-1.23
        Contents
                                  name: KUBERNETES_SERVICE_HOST
                                            value: 127.0.0.1
                                +         image: gcr.io/k8s-staging-cloud-provider-gcp/cloud-controller-manager:master@sha256:b3ac9d2d9cff8d736473ab0297c57dfb1924b50758e5cc75a80bacd9d6568f8a
                                -         image: gcr.io/k8s-staging-cloud-provider-gcp/cloud-controller-manager:master@sha256:f575cc54d0ac3abf0c4c6e8306d6d809424e237e51f4a9f74575502be71c607c
                                          imagePullPolicy: IfNotPresent
                                          livenessProbe:

Because of this newly updated image gcr.io/k8s-staging-cloud-provider-gcp/cloud-controller-manager:master@sha256:b3ac9d2d9cff8d736473ab0297c57dfb1924b50758e5cc75a80bacd9d6568f8a cloud-controller-manager pod is crashed.

Log message in pod : flag provided but not defined: -allocate-node-cidrs Usage of /go-runner: -also-stdout useful with log-file, log to standard output as well as the log file -log-file string If non-empty, save stdout to this file -redirect-stderr treat stderr same as stdout (default true)

on checking the docker image docker run --rm gcr.io/k8s-staging-cloud-provider-gcp/cloud-controller-manager:master@sha256:b3ac9d2d9cff8d736473ab0297c57dfb1924b50758e5cc75a80bacd9d6568f8a --help

did not list any of the below attributes which are in the cloud-controller-manager deamonset

  • args: - --allocate-node-cidrs=true - --cidr-allocator-type=CloudAllocator - --cluster-cidr=************ - --cluster-name=************* - --controllers=* - --leader-elect=true - --v=2 - --cloud-provider=gce - --use-service-account-credentials=true - --cloud-config=/etc/kubernetes/cloud.config

Please let us know how to fix this issue. How to avoid this automatic version updates of images. This new image is breaking the cluster.

RizwanaVyoma avatar Apr 08 '25 11:04 RizwanaVyoma

I am encountering this issue too with the following initial output pointing to a problem with the cloud-controller-manager as reported by @RizwanaVyoma :

$ kops version
Client version: 1.31.0

$ kops validate cluster --wait 15m
...

Validation Failed
W0408 17:05:03.705924 1052446 validate_cluster.go:230] (will retry): cluster not yet healthy
I0408 17:05:14.154048 1052446 gce_cloud.go:307] Scanning zones: [us-east1-b us-east1-c us-east1-d]
INSTANCE GROUPS
NAME                            ROLE            MACHINETYPE     MIN     MAX     SUBNETS
control-plane-us-east1-b        ControlPlane    n1-standard-4   1       1       us-east1
nodes-us-east1-b                Node            n2-standard-8   2       2       us-east1

NODE STATUS
NAME    ROLE    READY

VALIDATION ERRORS
KIND    NAME                                                                                                                                    MESSAGE
Machine https://www.googleapis.com/compute/v1/projects/myprojectname/zones/us-east1-b/instances/control-plane-us-east1-b-mxhm        machine "https://www.googleapis.com/compute/v1/projects/myprojectname/zones/us-east1-b/instances/control-plane-us-east1-b-mxhm" has not yet joined cluster
Machine https://www.googleapis.com/compute/v1/projects/myprojectname/zones/us-east1-b/instances/nodes-us-east1-b-h9tk                machine "https://www.googleapis.com/compute/v1/projects/myprojectname/zones/us-east1-b/instances/nodes-us-east1-b-h9tk" has not yet joined cluster
Machine https://www.googleapis.com/compute/v1/projects/myprojectname/zones/us-east1-b/instances/nodes-us-east1-b-xt6w                machine "https://www.googleapis.com/compute/v1/projects/myprojectname/zones/us-east1-b/instances/nodes-us-east1-b-xt6w" has not yet joined cluster
Pod     kube-system/cloud-controller-manager-pzzk9                                                                                              system-cluster-critical pod "cloud-controller-manager-pzzk9" is not ready (cloud-controller-manager)
Pod     kube-system/coredns-autoscaler-56467f9769-ltzwk                                                                                         system-cluster-critical pod "coredns-autoscaler-56467f9769-ltzwk" is pending
Pod     kube-system/coredns-db7b68989-59cw7                                                                                                     system-cluster-critical pod "coredns-db7b68989-59cw7" is pending

Validation Failed
W0408 17:05:14.944295 1052446 validate_cluster.go:230] (will retry): cluster not yet healthy
Error: validation failed: wait time exceeded during validation

nevdullcode avatar Apr 08 '25 17:04 nevdullcode

Can you try setting this in the cluster spec and run kops update cluster --yes to see if that fixes the issue?

spec:
  cloudControllerManager:
    image: gcr.io/k8s-staging-cloud-provider-gcp/cloud-controller-manager:v32.2.4

rifelpet avatar Apr 09 '25 00:04 rifelpet

@rifelpet Yes, that works. Cluster validation now completes successfully. Thank you!

nevdullcode avatar Apr 09 '25 15:04 nevdullcode

Upstream was also fixed in https://github.com/kubernetes/cloud-provider-gcp/pull/842

hakman avatar Apr 27 '25 05:04 hakman

Confirming the workaround suggested by @rifelpet is no longer required. Thanks, all!

nevdullcode avatar May 12 '25 15:05 nevdullcode

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Aug 10 '25 15:08 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Sep 09 '25 15:09 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Oct 09 '25 16:10 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Oct 09 '25 16:10 k8s-ci-robot