cloud-provider-gcp icon indicating copy to clipboard operation
cloud-provider-gcp copied to clipboard

CAPG: Upstream CCM manifest doesn't work

Open jayesh-srivastava opened this issue 10 months ago • 11 comments

Tried deploying CCM in a CAPG cluster and used the provided CCM manifest from (https://github.com/kubernetes/cloud-provider-gcp/blob/master/deploy/packages/default/manifest.yaml). The CCM pod is stuck in CrashLoopBack with this error:

unable to load configmap based request-header-client-ca-file: Get "https://127.0.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication": dial tcp 127.0.0.1:443: connect: connection refused

jayesh-srivastava avatar Apr 19 '24 05:04 jayesh-srivastava

This issue is currently awaiting triage.

If the repository mantainers determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Apr 19 '24 05:04 k8s-ci-robot

Please use:

  command: ['/usr/local/bin/cloud-controller-manager']
  args:
  - --cloud-provider=gce
  - --leader-elect=true
  - --use-service-account-credentials

and remove the env.

mcbenjemaa avatar Apr 26 '24 10:04 mcbenjemaa

/kind support

mcbenjemaa avatar Apr 26 '24 10:04 mcbenjemaa

Hi @mcbenjemaa , Thanks for the help. CCM pod is up now with these args

  - args:
    - --cloud-provider=gce
    - --leader-elect=true
    - --use-service-account-credentials
    - --allocate-node-cidrs=true
    - --cluster-cidr=192.168.0.0/16
    - --configure-cloud-routes=false

One more doubt, I see the cloud-controller-manager image being used is k8scloudprovidergcp/cloud-controller-manager:latest . How can I use k8s version specific images for ccm?

jayesh-srivastava avatar Apr 29 '24 21:04 jayesh-srivastava

You may have to build the image while the release process is being revampled, there are instructions in the README.

The :latest tag is aimed at CI / testing of the project itself I think.

/retitle CAPG: Upstream CCM manifest doesn't work

I don't think the manifest is necessarily meant to work with CAPG, I would expect CAPG to handle deploying everything?

Otherwise this may be in scope for #686

BenTheElder avatar May 07 '24 21:05 BenTheElder

Self deployed CCM: i got this error:

message="Error syncing load balancer: failed to ensure load balancer: instance not found"

mcbenjemaa avatar May 14 '24 13:05 mcbenjemaa

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Aug 12 '24 13:08 k8s-triage-robot

Something similar here, i'm trying to deploy the Cloud Controller Manager (CCM) and I'm encountering the following error:

I0823 08:10:42.838284       1 node_controller.go:391] Initializing node minplus0-md-2-vbvmr-856l7 with cloud provider
I0823 08:10:42.920926       1 gen.go:15649] GCEInstances.Get(context.Background.WithDeadline(2024-08-23 09:10:42.83965981 +0000 UTC m=+3629.567729051 [59m59.918720336s]), Key{"minplus0-md-2-vbvmr-856l7", zone: "europe-west4-b"}) = <nil>, googleapi: Error 404: The resource 'projects/clusterapi-369611/zones/europe-west4-b/instances/minplus0-md-2-vbvmr-856l7' was not found, notFound
E0823 08:10:42.921062       1 node_controller.go:213] error syncing 'minplus0-md-2-vbvmr-856l7': failed to get instance metadata for node minplus0-md-2-vbvmr-856l7: failed to get instance ID from cloud provider: instance not found, requeuing

I don't understand why CCM is adding the label zone as:

I0823 08:10:41.974944       1 node_controller.go:493] Adding node label from cloud provider: beta.kubernetes.io/instance-type=n2-standard-2
I0823 08:10:41.974950       1 node_controller.go:494] Adding node label from cloud provider: node.kubernetes.io/instance-type=n2-standard-2
I0823 08:10:41.974954       1 node_controller.go:505] Adding node label from cloud provider: failure-domain.beta.kubernetes.io/zone=europe-west4-b
I0823 08:10:41.974958       1 node_controller.go:506] Adding node label from cloud provider: topology.kubernetes.io/zone=europe-west4-b
I0823 08:10:41.974963       1 node_controller.go:516] Adding node label from cloud provider: failure-domain.beta.kubernetes.io/region=europe-west4
I0823 08:10:41.974968       1 node_controller.go:517] Adding node label from cloud provider: topology.kubernetes.io/region=europe-west4

The correct zone should be gce://clusterapi-369611/europe-west4-c/minplus0-md-2-vbvmr-856l7. This is how I'm deploying CCM:

        - name: cloud-controller-manager
          image: k8scloudprovidergcp/cloud-controller-manager:latest
          imagePullPolicy: IfNotPresent
          # ko puts it somewhere else... command: ['/usr/local/bin/cloud-controller-manager']
          command: ['/usr/local/bin/cloud-controller-manager']
          args:
            - --cloud-provider=gce  # Add your own cloud provider here!
            - --leader-elect=true
            - --use-service-account-credentials
            # these flags will vary for every cloud provider
            - --allocate-node-cidrs=true
            - --configure-cloud-routes=true
            - --cluster-cidr=192.168.0.0/16
            - --v=4
          livenessProbe:
            failureThreshold: 3
            httpGet:
              host: 127.0.0.1
              path: /healthz
              port: 10258
              scheme: HTTPS
            initialDelaySeconds: 15
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 15
          resources:
            requests:
              cpu: "200m"
          volumeMounts:
            - mountPath: /etc/kubernetes/cloud.config
              name: cloudconfig
              readOnly: true
      hostNetwork: true
      priorityClassName: system-cluster-critical
      volumes:
        - hostPath:
            path: /etc/kubernetes/cloud.config
            type: ""
          name: cloudconfig

esierra-stratio avatar Aug 23 '24 08:08 esierra-stratio

The correct zone should be gce://clusterapi-369611/europe-west4-c/minplus0-md-2-vbvmr-856l7.

what do you mean by correct zone there?

the instance url is https://www.googleapis.com/compute/v1/projects/{PROJECT}/zones/{ZONE}/instances/{VM_INSTANCE}

that is the providerId, isn't it?

aojea avatar Aug 23 '24 22:08 aojea

The issue is that the GCEInstances.Get function constructs the provider ID with the wrong zone. It assumes the zone must match where the master CCM is deployed (in this case, europe-west4-b), instead of the correct one, which is europe-west4-c. That's why the CCM couldn't find the instance.

Is there any way to make the CCM check every single zone? Maybe a multizone option or something similar?

esierra-stratio avatar Aug 26 '24 07:08 esierra-stratio

Solved!

          args:
            - --cloud-provider=gce  # Add your own cloud provider here!
            - --leader-elect=true
            - --use-service-account-credentials
            # these flags will vary for every cloud provider
            - --allocate-node-cidrs=true
            - --cluster-cidr=192.168.0.0/16
            - --v=4
            - --cloud-config=/etc/kubernetes/gce.conf
          livenessProbe:
            failureThreshold: 3
            httpGet:
              host: 127.0.0.1
              path: /healthz
              port: 10258
              scheme: HTTPS
            initialDelaySeconds: 15
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 15
          resources:
            requests:
              cpu: "200m"
          volumeMounts:
            - mountPath: /etc/kubernetes/gce.conf
              name: cloudconfig
              readOnly: true
      hostNetwork: true
      priorityClassName: system-cluster-critical
      volumes:
        - hostPath:
            path: /etc/kubernetes/gce.conf
            type: FileOrCreate
          name: cloudconfig

where gce.conf:

[Global]
multizone=true

esierra-stratio avatar Aug 26 '24 08:08 esierra-stratio

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Sep 25 '24 09:09 k8s-triage-robot