kops icon indicating copy to clipboard operation
kops copied to clipboard

Can't deploy a cluster on AWS, instances not joining the cluster, node name mismatch

Open nicolasespiau opened this issue 2 years ago • 7 comments

/kind bug

1. What kops version are you running? The command kops version, will display this information.

The issue occurs with 1.23.2, 1.24.x, 1.25.0-alpha.1, all versions.

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag.

1.22.11

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

kops create cluster -f PATH/TO/MY/full.yaml

5. What happened after the commands executed?

Kops creates AWS resources, nodes are up and running, but even the master is not joining the cluster.

I am working in a private hosted zone, with specific dhcp.

Error:

Failed to contact API server when waiting for CSINode publishing: csinodes.storage.k8s.io "ip-10-1-1-114.eu-west-1.compute.internal" is forbidden: User "system:node:ip-10-1-1-114.casetespotes.priv" cannot get resource "csinodes" in API group "storage.k8s.io" at the cluster scope: can only access CSINode with the same name as the requesting node

6. What did you expect to happen?

I expected the nodes to join the cluster with a proper name.

7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest. You may want to remove your cluster name and other sensitive information.

---
apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  name: nonprod.casetespotes.priv
spec:
  api:
    loadBalancer:
      class: Network
      type: Internal
      additionalSecurityGroups:
      - sg-xxx
  authentication:
    aws:
      identityMapping:
        - arn: arn:aws:iam::xxx:user/nicolas
          username: nicolas
          groups: ["system:masters"]
        - arn: arn:aws:iam::xxx:user/stephane
          username: stephane
          groups: ["system:masters"]
        - arn: arn:aws:iam::xxx:role/KubeNonProdAdmin
          username: nonprodadmin
          groups: ["system:masters"]
  authorization:
    rbac: {}
  channel: stable
  cloudConfig:
    awsEBSCSIDriver:
      enabled: true
      version: v1.2.1
    manageStorageClasses: true
  cloudProvider: aws
  clusterDNSDomain: cluster.local
  configBase: s3://XXX/nonprod.casetespotes.priv
  containerRuntime: docker
  dnsZone: XXX
  docker:
    execOpt:
    - native.cgroupdriver=cgroupfs
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-eu-west-1a
      name: a
      volumeType: gp3
    memoryRequest: 100Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-eu-west-1a
      name: a
      volumeType: gp3
    memoryRequest: 100Mi
    name: events
  externalDns:
    provider: dns-controller
  externalPolicies:
    master:
    - arn:aws:iam::xxx:policy/AssumeRole
    node:
    - arn:aws:iam::xxx:policy/AssumeRole
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer:
    allowPrivileged: true
    anonymousAuth: false
    apiAudiences:
    - kubernetes.svc.default
    authorizationMode: RBAC,Node
    cloudProvider: aws
    enableAdmissionPlugins:
    - NodeRestriction
    kubeletPreferredAddressTypes:
    - InternalIP
    - Hostname
    - ExternalIP
    logLevel: 8
    requestheaderAllowedNames:
    - aggregator
  kubeDNS:
    cacheMaxConcurrent: 150
    cacheMaxSize: 1000
    cpuRequest: 100m
    domain: cluster.local
    memoryLimit: 170Mi
    memoryRequest: 70Mi
    nodeLocalDNS:
      cpuRequest: 25m
      enabled: false
      memoryRequest: 5Mi
    provider: CoreDNS
    serverIP: 100.64.0.10
  kubeProxy:
    hostnameOverride: '@aws'
    logLevel: 5
  kubelet:
    anonymousAuth: false
    cgroupDriver: cgroupfs
    anonymousAuth: false
    cloudProvider: aws
    hostnameOverride: '@aws'
    logLevel: 8
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.22.11
  masterInternalName: api.internal.nonprod.casetespotes.priv
  masterKubelet:
    anonymousAuth: false
    cgroupDriver: cgroupfs
    cloudProvider: aws
    clusterDNS: 100.64.0.10
    clusterDomain: cluster.local
    featureGates:
      CSIMigrationAWS: "true"
      InTreePluginAWSUnregister: "true"
    hostnameOverride: '@aws'
    logLevel: 8
    networkPluginName: cni
    nonMasqueradeCIDR: 100.64.0.0/10
  masterPublicName: api.nonprod.casetespotes.priv
  networkCIDR: 10.1.0.0/16
  networkID: vpc-xxx
  networking:
    calico: {}
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 0.0.0.0/0
  sshKeyName: kube_pprd_rsa
  subnets:
  - cidr: 10.1.1.0/24
    id: subnet-xxx
    name: eu-west-1a
    type: Private
    zone: eu-west-1a
  - cidr: 10.1.3.0/24
    id: subnet-xxx
    name: eu-west-1b
    type: Private
    zone: eu-west-1b
  - cidr: 10.1.2.0/24
    id: subnet-xxx
    name: utility-eu-west-1a
    type: Utility
    zone: eu-west-1a
  - cidr: 10.1.4.0/24
    id: subnet-xxx
    name: utility-eu-west-1b
    type: Utility
    zone: eu-west-1b
  topology:
    dns:
      type: Private
    masters: private
    nodes: private
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: nonprod.casetespotes.priv
  name: master-eu-west-1a
spec:
  additionalSecurityGroups:
  - sg-xxx
  associatePublicIp: false
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20220308
  instanceMetadata:
    httpPutResponseHopLimit: 3
    httpTokens: required
  machineType: t3a.large
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-eu-west-1a
  role: Master
  rootVolumeType: gp3
  subnets:
  - eu-west-1a
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: nonprod.casetespotes.priv
  name: nodes-eu-west-1a
spec:
  additionalSecurityGroups:
  - sg-xxx
  associatePublicIp: false
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20220308
  instanceMetadata:
    httpPutResponseHopLimit: 1
    httpTokens: required
  machineType: t3a.medium
  maxSize: 2
  minSize: 2
  nodeLabels:
    category: regular
    kops.k8s.io/instancegroup: nodes-eu-west-1a
  role: Node
  rootVolumeType: gp3
  subnets:
  - eu-west-1a
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: nonprod.casetespotes.priv
  name: nodes-eu-west-1b
spec:
  additionalSecurityGroups:
  - sg-xxx
  associatePublicIp: false
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20220308
  instanceMetadata:
    httpPutResponseHopLimit: 1
    httpTokens: required
  machineType: t3a.medium
  maxSize: 1
  minSize: 1
  nodeLabels:
    category: regular
    kops.k8s.io/instancegroup: nodes-eu-west-1b
  role: Node
  rootVolumeType: gp3
  subnets:
  - eu-west-1b
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: nonprod.casetespotes.priv
  name: largenode
spec:
  additionalSecurityGroups:
  - sg-xxx
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20220308
  machineType: t3a.large
  maxSize: 2
  minSize: 1
  nodeLabels:
    category: largenode
    kops.k8s.io/instancegroup: largenode
  role: Node
  rootVolumeType: gp3
  subnets:
  - eu-west-1a
  - eu-west-1b
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: nonprod.casetespotes.priv
  name: tools
spec:
  additionalSecurityGroups:
  - sg-xxx
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20220308
  machineType: t3a.large
  maxSize: 2
  minSize: 1
  nodeLabels:
    category: tools
  role: Node
  rootVolumeType: gp3
  subnets:
  - eu-west-1a

8. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

I've seen some issues here and there about dhcp option sets in private hosted zone, but I've already managed to deploys clusters with kops in my private hosted zone with the same config.
I had the problem once and rolling back kops to a previous version solved the issue, but this time it doesn't.

nicolasespiau avatar Jul 01 '22 11:07 nicolasespiau

So you are right that this is related to DHCP options. kOps depends on AWS cloud controller manager, which has the requirements specified here: https://cloud-provider-aws.sigs.k8s.io/prerequisites/

For kOps, the more flexible resource-based naming alternative will be used by used if you are using kops 1.24 and kubernetes version is 1.24. You can opt-in with earlier k8s versions by enabling CCM:

spec:
  cloudControllerManager: {}

/remove-kind bug /kind support

olemarkus avatar Jul 01 '22 11:07 olemarkus

Maybe try not setting the awsEBSCSIDriver.version. Seems a bit too old.

hakman avatar Jul 01 '22 11:07 hakman

I've struggled a little, but I think I finally manged to get my cluster up and running. I still have some things to deploy to be sure everything works fine, but at least my nodes are joining the cluster.

Here is the changes I've made to make it work:

  • kops version: 1.24.0-beta.3
  • cluster.spec.kubernetesVersion: 1.24.0
  • cluster.spec.containerRuntime: containerd
  • removed all occurences of cgroupDriver
  • removed cluster.spec.awsEBSCSIDriver as metionned by @hakman
  • replaced all cloudProvider: aws by external
  • added some policies as indicated in [https://cloud-provider-aws.sigs.k8s.io/prerequisites/](AWS Cloud provider prerequisites)

Here is my cluster.yaml file:

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  name: nonprod.casetespotes.priv
spec:
  api:
    loadBalancer:
      class: Network
      type: Internal
      additionalSecurityGroups:
      - sg-xxx
  authentication:
    aws:
      identityMapping:
        - arn: arn:aws:iam::xxx:user/nicolas
          username: nicolas
          groups: ["system:masters"]
        - arn: arn:aws:iam::xxx:user/stephane
          username: stephane
          groups: ["system:masters"]
        - arn: arn:aws:iam::xxx:role/KubeNonProdAdmin
          username: nonprodadmin
          groups: ["system:masters"]
  authorization:
    rbac: {}
  channel: stable
  cloudConfig:
    awsEBSCSIDriver:
      enabled: true
    manageStorageClasses: true
  cloudControllerManager:
    cloudProvider: aws
  cloudProvider: aws
  clusterDNSDomain: cluster.local
  configBase: s3://ctp-state-store/nonprod/nonprod.casetespotes.priv
  containerRuntime: containerd
  dnsZone: XXX
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-eu-west-1a
      name: a
      volumeType: gp3
    memoryRequest: 100Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-eu-west-1a
      name: a
      volumeType: gp3
    memoryRequest: 100Mi
    name: events
  externalDns:
    provider: dns-controller
  externalPolicies:
    master:
    - arn:aws:iam::xxx:policy/AssumeRole
    - arn:aws:iam::xxx:policy/KubeControlPlanePolicy
    node:
    - arn:aws:iam::xxx:policy/AssumeRole
    - arn:aws:iam::xxx:policy/KubeNodePolicy
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer:
    allowPrivileged: true
    anonymousAuth: false
    apiAudiences:
    - kubernetes.svc.default
    authorizationMode: RBAC,Node
    cloudProvider: external
    enableAdmissionPlugins:
    - NodeRestriction
    kubeletPreferredAddressTypes:
    - InternalIP
    - Hostname
    - ExternalIP
    logLevel: 8
    requestheaderAllowedNames:
    - aggregator
  kubeDNS:
    cacheMaxConcurrent: 150
    cacheMaxSize: 1000
    cpuRequest: 100m
    domain: cluster.local
    memoryLimit: 170Mi
    memoryRequest: 70Mi
    nodeLocalDNS:
      cpuRequest: 25m
      enabled: false
      memoryRequest: 5Mi
    provider: CoreDNS
    serverIP: 100.64.0.10
  kubeProxy:
    hostnameOverride: '@aws'
    logLevel: 5
  kubelet:
    anonymousAuth: false
    cloudProvider: external
    hostnameOverride: '@aws'
    logLevel: 8
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: "1.24.0"
  masterInternalName: api.internal.nonprod.casetespotes.priv
  masterKubelet:
    anonymousAuth: false
    cloudProvider: external
    clusterDNS: 100.64.0.10
    clusterDomain: cluster.local
    featureGates:
      CSIMigrationAWS: "true"
      InTreePluginAWSUnregister: "true"
    hostnameOverride: '@aws'
    logLevel: 8
  masterPublicName: api.nonprod.casetespotes.priv
  networkCIDR: 10.1.0.0/16
  networkID: vpc-xxx
  networking:
    calico: {}
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 0.0.0.0/0
  sshKeyName: kube_pprd_rsa
  subnets:
  - cidr: 10.1.1.0/24
    id: subnet-xxx
    name: eu-west-1a
    type: Private
    zone: eu-west-1a
  - cidr: 10.1.3.0/24
    id: subnet-xxx
    name: eu-west-1b
    type: Private
    zone: eu-west-1b
  - cidr: 10.1.2.0/24
    id: subnet-xxx
    name: utility-eu-west-1a
    type: Utility
    zone: eu-west-1a
  - cidr: 10.1.4.0/24
    id: subnet-xxx
    name: utility-eu-west-1b
    type: Utility
    zone: eu-west-1b
  topology:
    dns:
      type: Private
    masters: private
    nodes: private

nicolasespiau avatar Jul 01 '22 14:07 nicolasespiau

I have had to remove --network-cidr=10.0.0.0/16 to get Kops to build my cluster in AWS.

richard-scott avatar Jul 11 '22 19:07 richard-scott

I have had to remove --network-cidr=10.0.0.0/16 to get Kops to build my cluster in AWS.

Actually, cancel that idea. I can't get the latest 1.24.0 to bring up a cluster at all in AWS. I just get stuck with errors like this:

VALIDATION ERRORS
KIND    NAME                                                            MESSAGE
Node    i-0857b9d787768ed77                                             master "i-0857b9d787768ed77" is missing kube-controller-manager pod
Node    i-0857b9d787768ed77                                             master "i-0857b9d787768ed77" is missing kube-scheduler pod
Node    i-0d367a35283fe9bf9                                             master "i-0d367a35283fe9bf9" is missing kube-controller-manager pod
Node    i-0d367a35283fe9bf9                                             master "i-0d367a35283fe9bf9" is missing kube-scheduler pod
Node    i-0fd371597c077aa50                                             master "i-0fd371597c077aa50" is missing kube-controller-manager pod
Node    i-0fd371597c077aa50                                             master "i-0fd371597c077aa50" is missing kube-scheduler pod
Pod     kube-system/aws-load-balancer-controller-6577c7c6b9-hxp2j       system-cluster-critical pod "aws-load-balancer-controller-6577c7c6b9-hxp2j" is pending
Pod     kube-system/aws-load-balancer-controller-6577c7c6b9-j5dpz       system-cluster-critical pod "aws-load-balancer-controller-6577c7c6b9-j5dpz" is pending
Pod     kube-system/cert-manager-webhook-786b8495bf-vkz5b               system-cluster-critical pod "cert-manager-webhook-786b8495bf-vkz5b" is not ready (cert-manager)
Pod     kube-system/coredns-6b5499ccb9-hzp87                            system-cluster-critical pod "coredns-6b5499ccb9-hzp87" is not ready (coredns)
Pod     kube-system/ebs-csi-controller-7b86d5f8f6-5fnj4                 system-cluster-critical pod "ebs-csi-controller-7b86d5f8f6-5fnj4" is not ready (ebs-plugin)
Pod     kube-system/ebs-csi-controller-7b86d5f8f6-hbk6r                 system-cluster-critical pod "ebs-csi-controller-7b86d5f8f6-hbk6r" is not ready (ebs-plugin)

richard-scott avatar Jul 12 '22 11:07 richard-scott

Based on those symptoms, it seems like you have a webhook that blocks containers from starting

olemarkus avatar Jul 12 '22 11:07 olemarkus

I have had to reduce my replica count in the resources that KOPS made otherwise it attempted to create 2 replicas when there was only one node to put them on.

richard-scott avatar Jul 27 '22 13:07 richard-scott

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Oct 25 '22 13:10 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Nov 24 '22 14:11 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Dec 24 '22 14:12 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Dec 24 '22 14:12 k8s-ci-robot