kops icon indicating copy to clipboard operation
kops copied to clipboard

kops k8s cluster deploy on Debian 11 not working as expected (coredns, ebs-csi)

Open danielduduta opened this issue 2 years ago • 1 comments

/kind bug

1. What kops version are you running? The command kops version, will display this information. Client version: 1.24.1 (git-v1.24.1)

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag. Client Version: v1.24.3 Kustomize Version: v4.5.4 Server Version: v1.24.3

3. What cloud provider are you using? AWS

4. What commands did you run? What is the simplest way to reproduce this issue? kops create cluster operations.k8s.local --node-count 1 --networking amazonvpc --zones eu-west-1a,eu-west-1b,eu-west-1c --master-size c5.xlarge --node-size c5.xlarge --dry-run -o yaml > operations.k8s.yam

Update spec: image: 136693071363/debian-11-amd64-20220816-1109 for all InstanceGroups

kops create -f operations.k8s.yaml kops update cluster --name operations.k8s.local --yes

5. What happened after the commands executed?

admin@i-0061efbbb5a6c5ff8:~$ kubectl get pods --all-namespaces
NAMESPACE     NAME                                          READY   STATUS             RESTARTS        AGE
kube-system   aws-cloud-controller-manager-9mmq2            1/1     Running            0               4m25s
kube-system   aws-node-lvbqr                                1/1     Running            0               4m25s
kube-system   aws-node-xr78k                                1/1     Running            0               3m20s
kube-system   coredns-autoscaler-865477f6c7-v4dqj           1/1     Running            0               4m24s
kube-system   coredns-d48868b66-jx7lv                       0/1     Running            1 (40s ago)     4m24s
kube-system   dns-controller-7467bcd6ff-fl6jk               1/1     Running            0               4m24s
kube-system   ebs-csi-controller-5c9b6f6b6-g5zbm            4/5     CrashLoopBackOff   4 (28s ago)     4m24s
kube-system   ebs-csi-node-5tzn7                            2/3     CrashLoopBackOff   4 (28s ago)     4m25s
kube-system   ebs-csi-node-ssq8p                            3/3     Running            4 (6s ago)      3m20s
kube-system   etcd-manager-events-i-0061efbbb5a6c5ff8       1/1     Running            0               3m35s
kube-system   etcd-manager-main-i-0061efbbb5a6c5ff8         1/1     Running            0               3m39s
kube-system   kops-controller-x7rgp                         1/1     Running            0               4m25s
kube-system   kube-apiserver-i-0061efbbb5a6c5ff8            2/2     Running            0               3m21s
kube-system   kube-controller-manager-i-0061efbbb5a6c5ff8   1/1     Running            2 (4m55s ago)   3m7s
kube-system   kube-proxy-i-0061efbbb5a6c5ff8                1/1     Running            0               3m57s
kube-system   kube-proxy-i-0a5b036b8f7263fda                1/1     Running            0               2m35s
kube-system   kube-scheduler-i-0061efbbb5a6c5ff8            1/1     Running            0               3m24s

6. What did you expect to happen? coredns and ebs-csi running 7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest. You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: "2022-08-22T08:04:46Z"
  name: operations.k8s.local
spec:
  api:
    loadBalancer:
      class: Classic
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://figshare-kops/operations.k8s.local
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-eu-west-1a
      name: a
    memoryRequest: 100Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-eu-west-1a
      name: a
    memoryRequest: 100Mi
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubelet:
    anonymousAuth: false
  kubernetesApiAccess:
  - 0.0.0.0/0
  - ::/0
  kubernetesVersion: 1.24.3
  masterPublicName: api.operations.k8s.local
  networkCIDR: 172.20.0.0/16
  networking:
    amazonvpc: {}
  nonMasqueradeCIDR: 172.20.0.0/16
  sshAccess:
  - 0.0.0.0/0
  - ::/0
  sshKeyName: duduta-operations
  subnets:
  - cidr: 172.20.32.0/19
    name: eu-west-1a
    type: Public
    zone: eu-west-1a
  - cidr: 172.20.64.0/19
    name: eu-west-1b
    type: Public
    zone: eu-west-1b
  - cidr: 172.20.96.0/19
    name: eu-west-1c
    type: Public
    zone: eu-west-1c
  topology:
    dns:
      type: Public
    masters: public
    nodes: public

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2022-08-22T08:04:49Z"
  labels:
    kops.k8s.io/cluster: operations.k8s.local
  name: master-eu-west-1a
spec:
  image: 136693071363/debian-11-amd64-20220816-1109
  instanceMetadata:
    httpPutResponseHopLimit: 3
    httpTokens: required
  machineType: c5.xlarge
  manager: CloudGroup
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-eu-west-1a
  role: Master
  subnets:
  - eu-west-1a

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2022-08-22T08:04:49Z"
  labels:
    kops.k8s.io/cluster: operations.k8s.local
  name: nodes-eu-west-1a
spec:
  image: 136693071363/debian-11-amd64-20220816-1109
  instanceMetadata:
    httpPutResponseHopLimit: 1
    httpTokens: required
  machineType: c5.xlarge
  manager: CloudGroup
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: nodes-eu-west-1a
  role: Node
  subnets:
  - eu-west-1a

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2022-08-22T08:04:49Z"
  labels:
    kops.k8s.io/cluster: operations.k8s.local
  name: nodes-eu-west-1b
spec:
  image: 136693071363/debian-11-amd64-20220816-1109
  instanceMetadata:
    httpPutResponseHopLimit: 1
    httpTokens: required
  machineType: c5.xlarge
  manager: CloudGroup
  maxSize: 0
  minSize: 0
  nodeLabels:
    kops.k8s.io/instancegroup: nodes-eu-west-1b
  role: Node
  subnets:
  - eu-west-1b

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2022-08-22T08:04:50Z"
  labels:
    kops.k8s.io/cluster: operations.k8s.local
  name: nodes-eu-west-1c
spec:
  image: 136693071363/debian-11-amd64-20220816-1109
  instanceMetadata:
    httpPutResponseHopLimit: 1
    httpTokens: required
  machineType: c5.xlarge
  manager: CloudGroup
  maxSize: 0
  minSize: 0
  nodeLabels:
    kops.k8s.io/instancegroup: nodes-eu-west-1c
  role: Node
  subnets:
  - eu-west-1c

8. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here.

admin@i-0061efbbb5a6c5ff8:~$ kubectl logs coredns-d48868b66-jx7lv -n kube-system
[WARNING] plugin/kubernetes: starting server with unsynced Kubernetes API
.:53
[INFO] plugin/reload: Running configuration MD5 = 35aa07598ca78c83ea20e1faff6dfc16
CoreDNS-1.8.6
linux/amd64, go1.17.1, 13a9191
[ERROR] plugin/errors: 2 8293112652682877544.1657391515730422967. HINFO: read udp 172.20.41.54:57500->172.20.0.2:53: i/o timeout
admin@i-0061efbbb5a6c5ff8:~$ kubectl logs ebs-csi-node-ssq8p -n kube-system
Defaulted container "ebs-plugin" out of: ebs-plugin, node-driver-registrar, liveness-probe
I0822 08:11:46.901136       1 metadata.go:85] retrieving instance data from ec2 metadata
W0822 08:11:53.180678       1 metadata.go:88] ec2 metadata is not available
I0822 08:11:53.180696       1 metadata.go:96] retrieving instance data from kubernetes api
I0822 08:11:53.181435       1 metadata.go:101] kubernetes api is available
panic: error getting Node i-0a5b036b8f7263fda: Get "https://172.20.0.1:443/api/v1/nodes/i-0a5b036b8f7263fda": dial tcp 172.20.0.1:443: i/o timeout

goroutine 1 [running]:
github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver.newNodeService(0xc00003afa0)
	/go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver/node.go:86 +0x269
github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver.NewDriver({0xc000517f30, 0x8, 0x55})
	/go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver/driver.go:95 +0x38e
main.main()
	/go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/cmd/main.go:46 +0x365

9. Anything else do we need to know?

danielduduta avatar Aug 22 '22 08:08 danielduduta

Same behavior with Ubuntu 22.04 (Jammy), works fine with Ubuntu 20.04 (Focal)

danielduduta avatar Aug 22 '22 10:08 danielduduta

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Nov 20 '22 11:11 k8s-triage-robot

Ubuntu 2204 is a known issue. See #14140 Could be the same issue happens on Debian 11 as well.

olemarkus avatar Nov 20 '22 13:11 olemarkus

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Dec 20 '22 14:12 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Jan 19 '23 14:01 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Jan 19 '23 14:01 k8s-ci-robot

Same behavior with Ubuntu 22.04 (Jammy), works fine with Ubuntu 20.04 (Focal)

Also happened in Ubuntu 20.04.5 LTS EKS version --> v1.23.7 kernel version --> 5.15.0-1022-aws container runtime --> containerd://1.5.9

imagen

k logs pod/ebs-csi-node-gf77x -n kube-system I0301 16:38:59.305809 1 node.go:98] regionFromSession Node service I0301 16:38:59.305873 1 metadata.go:85] retrieving instance data from ec2 metadata W0301 16:39:05.584601 1 metadata.go:88] ec2 metadata is not available I0301 16:39:05.584617 1 metadata.go:96] retrieving instance data from kubernetes api I0301 16:39:05.585571 1 metadata.go:101] kubernetes api is available panic: error getting Node ip-172-21-12-4.ec2.internal: Get "https://10.100.0.1:443/api/v1/nodes/ip-172-21-12-4.ec2.internal": dial tcp 10.100.0.1:443: i/o timeout

goroutine 1 [running]: github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver.newNodeService(0xc00022c1e0) /go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver/node.go:101 +0x345 github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver.NewDriver({0xc000525f30, 0x8, 0x31c6b50?}) /go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver/driver.go:95 +0x393 main.main() /go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/cmd/main.go:46 +0x37d

dalvarezquiroga avatar Mar 01 '23 16:03 dalvarezquiroga

Does anyone understand what is the root cause of the issue on Ubuntu 22.04 and are there known workarounds?

plaformsre avatar Oct 03 '23 09:10 plaformsre