containers-roadmap icon indicating copy to clipboard operation
containers-roadmap copied to clipboard

[EKS] [request]: API flag to initialize completely bare EKS cluster

Open sc250024 opened this issue 4 years ago • 38 comments

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request

Essentially, I'm looking for an extra option in the AWS API where EKS is initialized with a completely bare cluster (i.e. no coredns, aws-node, or kube-proxy deployments / daemonsets). Only the EKS control plane is provided.

Which service(s) is this request for?

EKS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Kubernetes lifecycle management is a problem which many tools are solving / attempting to solve. With Kubernetes objects, there's no easy way to "inherit" an object that already exists, and apply changes over it. If an object exists, and you want to change it without completely deleting / reinstalling it, you either have to (AFAIK):

  • Run a kubectl edit or kubectl patch with the in-place objects to change what you want.
  • Have the original manifest which was applied previously, and run a kubectl apply with the new options.

In fact, the Kubernetes documentation here talks about the various methods: https://kubernetes.io/docs/concepts/cluster-administration/manage-deployment/#in-place-updates-of-resources

With Helm charts this problem is pronounced. If I want to apply a Helm chart, and someone has already applied a Kubernetes YAML manifest manually with similar names, I will get errors with Helm because those objects already exist.

For my company, we want to provision / de-provision EKS clusters with as much automation as possible, but what we find is that there are certain manual steps which must be performed with EKS. To name a few:

  • Kube-Proxy ConfigMap metrics
    • In order to get Prometheus to successfully scrape the kube-proxy process, we have to update the listen address in the ConfigMap like so:
# Edit kube-proxy ConfigMap to allow metrics scraping.
# Replace `metricsBindAddress: 127.0.0.1:10249` with `metricsBindAddress: 0.0.0.0:10249`
$ kubectl edit --namespace kube-system configmap/kube-proxy-config

# Afterwards, restart all `kube-proxy` pods
$ kubectl delete pods --namespace kube-system --selector='k8s-app=kube-proxy'
  • CoreDNS

    • The CoreDNS component that comes with EKS does not have a proportional autoscaler, whereas the CoreDNS Helm chart does. After provisioning the cluster, we then (1) delete the default CoreDNS deployment and all associated resources, and then (2) apply the Helm chart.
  • AWS VPC CNI

    • Same as CoreDNS. After provisioning the cluster, we then (1) delete the default AWS VPC CNI, and then (2) apply the AWS VPC CNI Helm chart.
  • Kube-proxy

    • Same as above.
  • AWS Auth ConfigMap

    • The ConfigMap object already exists, so we have to take special care to update it ourselves.

All of these (and similar) problems would be solved by simply having a flag to initialize a cluster which is completely empty, and let whatever tools we use internally to build up the cluster as we see fit. This is more of a functionality for power / advanced users, but the use case definitely exists.

Are you currently working around this issue?

We are, but we are either performing these actions manually, or as part of a pipeline. For the case of CoreDNS / AWS VPC CNI / Kube-Proxy, we essentially must store a Kubernetes YAML in our Git repositories which we can point to when running kubectl delete.

sc250024 avatar May 29 '20 15:05 sc250024

To improve the workaround, it should be possible to use kubectl annotate and then adopt the existing resource objects into a Helm release, as of Helm 3.2.0. See "Release Note" on the pull request for details.

For some reason, that didn't actually make it into the release notes, or the docs. Future work will automate this further in Helm, so they might be waiting to document it with that.

TBBle avatar Jun 15 '20 19:06 TBBle

@sc250024 do you have an example of your workflow for capturing CoreDNS and kube-proxy as Helm charts? We already do this for the aws-vpc-cni (we use the remote yaml referenced in the upgrade guide to delete this).

stevehipwell avatar Jan 07 '21 10:01 stevehipwell

@sc250024 do you have an example of your workflow for capturing CoreDNS and kube-proxy as Helm charts? We already do this for the aws-vpc-cni (we use the remote yaml referenced in the upgrade guide to delete this).

Actually we don't do that currently; we're using the coredns and kube-proxy installations that come with the cluster by default. For CoreDNS specifically, we'd like to use the Helm chart since it includes an autoscaler that adds Pods as the cluster size itself scales.

In general, we automate a lot of our provisioning, and right now, we have to do a lot of hacks to either apply something over an existing resource, or patch an existing resource. It's really just running kubectl commands through Terraform.

sc250024 avatar Jan 07 '21 11:01 sc250024

@sc250024 it sounds like we've got very similar requirements. Currently we have automated kube-proxy and CoreDNS version patching via Terrafrom and when we bootstrap a cluster we remove the aws-vpc-cni installed and replace it with the helm chart. My highest priority would be to delete the default CoreDNS and capture that with a helm chart.

stevehipwell avatar Jan 07 '21 11:01 stevehipwell

I was curious, and had a play with the CoreDNS Helm chart to see how close I could get to generating the existing AWS deployment of CoreDNS. It's not far off, but it highlights a few differences:

  • AWS might be running a patched CoreDNS deployment that needs to look at the node list, based on ClusterRole differences.
  • AWS has done some hardening in their deployment (read-only root with /tmp on emptyDir volume, and all privileges dropped except NET_BIND_SERVICE) which CoreDNS doesn't have in their chart, and can't fully express in the values.yaml
  • CoreDNS Helm chart Prometheus metrics support might not be functional, looks like it's missing a containerPort.
  • CoreDNS Helm chart distinguishes Live and Ready, AWS's deployment does not. (Although maybe AWS's version of CoreDNS suffers from https://github.com/coredns/coredns/issues/4099 in the ready plugin)
  • CoreDNS Helm chart specifies "criticalness" using annotations that (at least in one case) haven't been honoured since k8s 1.16.

(There's more details and less-impactful differences in the comments on the YAML)

So you could use the values.yaml attached (updating REGION and DNS_CLUSTER_IP as is done for the AWS applied YAML), and then annotate/label the conflicting objects for adoption, delete the kube-dns Service (because you need to steal its ClusterIP), and helm install should adopt the existing options and take over as the cluster DNS service.

Of course, deleting the kube-dns Service isn't great as you have a period of DNS outage, but I'm unaware of a good way to transition cleanly without that. That should be the only object you need to delete by hand before installing CoreDNS though, so the outage period can be measured in 10's of seconds, assuming everything else works. (Unless there are other things with immutable fields... The Deployment might be one, actually)

Helpfully, every object in the AWS yaml is labelled with eks.amazonaws.com/component: kube-dns, so it's easy to hunt-down leftover objects after the adoption: things with that label but lacking the app.kubernetes.io/managed-by: "Helm" label are orphaned leftovers. Things with both labels were adopted by Helm and are part of the chart now.

A couple of the above issues (and other things called out in the text) are possibly bug-reports or feature-requests to be raised with CoreDNS.

Note that these are not recommended settings. They are mirroring the existing AWS YAML as closely as possible, including possible feature regressions, e.g., rollback to CoreDNS 1.7.0, disabling lameduck and ttl in the service setup.

On the other hand, some are important, like limiting the Deployment to 64-bit Linux hosts, and EC2 (i.e. not Fargate). Unless you want CoreDNS on Fargate of course. Then it's a regression. ^_^

A values.yaml describing the differences
# Contrasting AWS CoreDNS 1.7.0 install from https://docs.aws.amazon.com/eks/latest/userguide/coredns.html
#  curl -o dns.yaml https://s3.us-west-2.amazonaws.com/amazon-eks/cloudformation/2020-10-29/dns.yaml
# VS the current CoreDNS 1.8.0 Helm chart
#  helm repo add coredns https://coredns.github.io/helm
#  helm repo update
#  helm template coredns coredns/coredns --namespace kube-system --values aws.coredns.values.yaml
# (This file is aws.coredns.values.yaml)

## Differences I could not capture:

# AWS's Service is named kube-dns, CoreDNS creates one named coredns

# The ClusterRole and ClusterRoleBinding in AWS's YAMLare Default and named system:coredns,
# with Auto-reconciliation disabled, see
# https://kubernetes.io/docs/reference/access-authn-authz/rbac/#default-roles-and-role-bindings
# and had the following extra rule, I'm not sure why.
#  - apiGroups:
#    - ""
#    resources:
#    - nodes
#    verbs:
#    - get
#
# This might be something that AWS have patched into their CoreDNS binary's kubernetes plugin,
# i.e. similar to the one propsed at https://github.com/coredns/coredns/issues/3077
# which was eventually punted as a different plugin and abandoned.
#
# CoreDNS Helm chart names its ClusterRole/Binding simply 'coredns' (i.e. fullNameOverride) and they are labelled as
#  kubernetes.io/cluster-service: true
# instead.

# The Prometheus metrics have a separate Service in the Helm chart, but are scraped
# from the main Service in the CoreDNS chart
# That said, the CoreDNS chart doesn't seem to have a containerPort exposed for them. Bug in the Helm chart?

# AWS's Pod has the following that CoreDNS Helm chart doesn't support
#        securityContext:
#          allowPrivilegeEscalation: false
#          capabilities:
#            add:
#            - NET_BIND_SERVICE
#            drop:
#            - all
#          readOnlyRootFilesystem: true

# CoreDNS Helm chart has the following annotations (old name for priorityClassName and tolerations respectively)
# when isClusterService is set.
# Goodness, these are old, and someone should fix the CoreDNS chart, as they are no longer effective in current k8s.
#        scheduler.alpha.kubernetes.io/critical-pod: ''
#        scheduler.alpha.kubernetes.io/tolerations: '[{"key":"CriticalAddonsOnly", "operator":"Exists"}]'

# AWS Pod mounts the config-volume read-only.

# Helm chart distinguishes readiness probe from health probe. (More-modern approach)

# Helm chart specifies a maxSurge (25%) for the Deployment's rollingUpdate.

# Various minor diferences:
# - Labels and annotations
# - The container port names are different
# - Generated Helm chart doesn't have namespace metadata, because Helm takes care of that.

fullnameOverride: coredns

serviceAccount:
  create: true

priorityClassName: system-cluster-critical

replicaCount: 2

image:
  repository: 602401143452.dkr.ecr.REGION.amazonaws.com/eks/coredns
  tag: v1.7.0-eksbuild.1

podAnnotations:
  eks.amazonaws.com/compute-type: ec2

service:
  clusterIP: DNS_CLUSTER_IP

extraVolumes:
- name: tmp
  emptyDir: {}

extraVolumeMounts:
- name: tmp
  mountPath: /tmp

terminationGracePeriodSeconds: 0

resources:
  limits:
    cpu: null
    memory: 170Mi
  requests:
    memory: 70Mi

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: "beta.kubernetes.io/os"
          operator: In
          values:
          - linux
        - key: "beta.kubernetes.io/arch"
          operator: In
          values:
          - amd64
          - arm64
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
    - podAffinityTerm:
        labelSelector:
          matchExpressions:
          - key: k8s-app
            operator: In
            values:
            - coredns
        topologyKey: kubernetes.io/hostname
      weight: 100

tolerations:
- key: node-role.kubernetes.io/master
  effect: NoSchedule
- key: "CriticalAddonsOnly"
  operator: "Exists"

prometheus:
  service:
    enabled: true

# Because of the way Helm works, you cannot override parts of this
# array, so the whole thing is copied out of the coredns/coredns
# defaults (without comments), and the differences with AWS noted.
servers:
- zones:
  - zone: .
  port: 53
  plugins:
  - name: errors
  - name: health
    # AWS doesn't have this
    #configBlock: |-
    #  lameduck 5s
  # AWS doesn't use this plugin at all, but it's needed elsewhere in the chart
  - name: ready
  - name: kubernetes
    parameters: cluster.local in-addr.arpa ip6.arpa
    configBlock: |-
      pods insecure
      fallthrough in-addr.arpa ip6.arpa
    # AWS doesn't have this
      # ttl 30
  - name: prometheus
    # parameters: 0.0.0.0:9153
    # AWS uses the below, I guess that means we're IPv6-ready? *cough*
    parameters: :9153
  - name: forward
    parameters: . /etc/resolv.conf
  - name: cache
    parameters: 30
  - name: loop
  - name: reload
  - name: loadbalance

TBBle avatar Jan 07 '21 15:01 TBBle

@TBBle that's a great summary of the differences. I think the next step would be to open up a PR on the CoreDNS chart to close the gap and allow al of the AWS settings to be set correctly.

As this issue is about providing a bare EKS cluster the potential downtime is probably not an issue, until a bare cluster is an option we remove the unwanted addons before the cluster has any node to run them on.

stevehipwell avatar Jan 07 '21 15:01 stevehipwell

I should point out that I haven't tested this. It was done using helm template and comparing the YAML. There's definitely opportunities to improve the CoreDNS Helm chart, but I don't think there was anything (except maybe the Prometheus metrics issue) that would make those improvements a blocker for doing the switch today, if I happened to be setting up an EKS cluster.

That said, I probably would not try and replicate some of the AWS differences, like fullnameOverride or the servers block changes, as they were just illustrative.

One thing to keep in mind is that perhaps it's important that the Service be named kube-dns? I didn't enforce that in my illustration. The k8s docs suggest that name might be relied upon by pieces of the system...

So it might be worth proposing that the CoreDNS Helm chart specifically be able to override the Service name separately from the existing fullname used to name the objects.

Or just install the chart as helm install kube-dns coredns/coredns and use fullnameOverride: kube-dns in the values.yaml. I suspect that won't adopt anything, after you delete the existing Service. But it's a little ugly. -_-

TBBle avatar Jan 07 '21 15:01 TBBle

@TBBle that's a great summary of the differences. I think the next step would be to open up a PR on the CoreDNS chart to close the gap and allow al of the AWS settings to be set correctly.

As this issue is about providing a bare EKS cluster the potential downtime is probably not an issue, until a bare cluster is an option we remove the unwanted addons before the cluster has any node to run them on.

@stevehipwell said what I was going to say, which is that the main point was to raise the question for AWS about whether or not they can support this feature. But to @TBBle , appreciate the help with the Helm chart values 😊

To me, it's either one of two things:

  • AWS manages kube-proxy, coredns, aws-node, and any other "core" cluster components completely. This means that autoscaling (where appropriate) happens automatically, and components are upgraded automatically when there's a cluster upgrade.

  • AWS allows people to use an "empty cluster" flag, and allows us to manage everything ourselves with no interruption from them.

Right now, it's in an awkward in-between state in my opinion. They're trying to provide the base cluster components (which makes sense), but stumble a bit with the upgrade path when the control plane is upgraded.

sc250024 avatar Jan 07 '21 16:01 sc250024

The AWS-managed add-ons approach shipped last month, albeit not many add-ons yet, just aws-node. https://github.com/aws/containers-roadmap/issues/252#issuecomment-736690357

That same ticket did confirm that "bare cluster" is also on the roadmap. I suspect it'll come implicitly once the reamining existing YAML add-ons are all migrated to EKS Add-ons, i.e., #1159.

TBBle avatar Jan 07 '21 16:01 TBBle

Hi all,

This feature is in our development plans and I've added it to our public roadmap. We envision that in time, all EKS clusters will use managed add-ons and we will not boot components into clusters that are not managed by EKS and you cannot control via the EKS APIs. Our 3 core add-ons (VPC CNI, coredns, kube-proxy) will still be enabled by default, but you can optionally elect to have them not be installed when you create the cluster.

tabern avatar Jan 14 '21 18:01 tabern

Hi all,

This feature is in our development plans and I've added it to our public roadmap. We envision that in time, all EKS clusters will use managed add-ons and we will not boot components into clusters that are not managed by EKS and you cannot control via the EKS APIs. Our 3 core add-ons (VPC CNI, coredns, kube-proxy) will still be enabled by default, but you can optionally elect to have them not be installed when you create the cluster.

Much appreciated @tabern. Thank you!

sc250024 avatar Jan 14 '21 18:01 sc250024

@tabern where are we with this after today's announcement?

stevehipwell avatar May 20 '21 21:05 stevehipwell

I have a hacky workaround: Change eks:addon-manager role in kube-system namespace to remove its permissions of update and patch for configmap.

shixuyue avatar Jun 01 '21 19:06 shixuyue

@shixuyue what exactly are you doing to manage kube-proxy and coredns?

stevehipwell avatar Jun 07 '21 07:06 stevehipwell

@stevehipwell I dont have special needs for kube-proxy, but I need to add consul forwarder to coredns, so it can resolve consul endpoints from another "cluster"(its not k8s).

shixuyue avatar Jun 07 '21 14:06 shixuyue

I see that the docs now contain a method for removing add-ins, but I don't think it is possible to do this without removing the default config. This could be useful if there was valid Helm charts for coredns and kube-proxy in aws/eks-charts (or instructions for using the official coredns Helm chart).

stevehipwell avatar Jun 07 '21 15:06 stevehipwell

oh, yea, my hacky workaround works for me. And each time we want to update the plugins, we need to enable the permissions that we just disabled. And once the update is done, we will have to disable it again. So add-on manager doesnt have the permission to revert corefile config map to default. Which is not ideal, but its easy and simple, good as a temp workaround.

shixuyue avatar Jun 07 '21 15:06 shixuyue

@stevehipwell we're working on making this change for later in 2021, the latest release for CoreDNS and Kube-proxy support was a step towards completing this issue.

Can you tell me more about how having a published helm chart would help you remove the default add-ons? Would you want to remove the add-on and keep the default config?

tabern avatar Jun 07 '21 15:06 tabern

@shixuyue we manually modify coredns and kube-proxy (fully uninstall aws-vpc-cni and use the Helm chart) and so would disable the add-ins from making any changes.

@tabern we'd want to create a bare cluster without kube-proxy, coredns & aws-vpc-cni and then install these ourselves from Helm charts. This gives us full control and allows us to have common patterns across cloud providers; knowing what each change is going to do and having this documented as code is essential to us. We already do this for aws-vpc-cni but haven't got working charts for the other two.

stevehipwell avatar Jun 07 '21 15:06 stevehipwell

Is there any news on this?

stevehipwell avatar Aug 26 '21 20:08 stevehipwell

i'm posting my workaround for anyone who could find it useful - run a bootstrap script on a fresh cluster to remove the default resources:

#!/usr/bin/env bash

set -euo pipefail

while test $# -gt 0; do
  case "$1" in
  -h | --help)
    echo " "
    echo "options:"
    echo "-h, --help            show brief help"
    echo "--context             specify kube contxt"
    exit 0
    ;;
  --context)
    shift
    if test $# -gt 0; then
      context=$1
    else
      echo "no kube context specified"
      exit 1
    fi
    shift
    ;;
  *)
    break
    ;;
  esac
done

for kind in daemonset clusterRole clusterRoleBinding serviceAccount; do
  echo "deleting $kind/aws-node"
  kubectl --context "$context" --namespace kube-system delete $kind aws-node
done

for kind in customResourceDefinition; do
  echo "deleting $kind/eniconfigs.crd.k8s.amazonaws.com"
  kubectl --context "$context" --namespace kube-system delete $kind eniconfigs.crd.k8s.amazonaws.com
done

for kind in daemonset serviceAccount; do
  echo "deleting $kind/kube-proxy"
  kubectl --context "$context" --namespace kube-system delete $kind kube-proxy
done

for kind in configMap; do
  echo "deleting $kind/kube-proxy-config"
  kubectl --context "$context" --namespace kube-system delete $kind kube-proxy-config
done

for kind in deployment serviceAccount configMap; do
  echo "deleting $kind/coredns"
  kubectl --context "$context" --namespace kube-system delete $kind coredns
done

for kind in service; do
  echo "deleting $kind/kube-dns"
  kubectl --context "$context" --namespace kube-system delete $kind kube-dns
done

for kind in storageclass; do
  echo "deleting $kind/gp2"
  kubectl --context "$context" delete $kind gp2
done

for kind in psp; do
  echo "deleting $kind/eks.privileged"
  kubectl --context "$context" delete $kind eks.privileged
done

for kind in clusterrole; do
  echo "deleting $kind/eks:podsecuritypolicy:privileged"
  kubectl --context "$context" delete $kind eks:podsecuritypolicy:privileged
done

for kind in clusterrolebinding; do
  echo "deleting $kind/eks:podsecuritypolicy:authenticated"
  kubectl --context "$context" delete $kind eks:podsecuritypolicy:authenticated
done

Note that the script deletes the default storageclass and psp as well, remove these parts if you don't manage these resources yourself.

dudicoco avatar Aug 29 '21 15:08 dudicoco

@tabern any more news on this feature?

stevehipwell avatar Oct 06 '21 13:10 stevehipwell

@tabern is there any feedback regarding this?

pierluigilenoci avatar Nov 16 '21 12:11 pierluigilenoci

This is a really frustrating situation to be honest. Official way now I guess is to use addons but lack of customization just make it useless, especially with kube-proxy metrics being bound to localhost making it unusable with prometheus. Though it kinda contradicts with providing official aws-vpc-cni chart, while not providing even manifests for coredns/kube-proxy.. And new clusters are provisioned with vpc-cni 1.7.5 while latest chart has 1.9.3 and AWS console and amazon-vpc-cni-k8s have 1.10.1 🤷🏻‍♂️

Managed solutions are supposed to "just work", not "and also do the following steps manually because we hate automation at our internal team".

Ranting aside, I think if cluster will be created bare, without managed addons, then they must be installed before adding nodes: I've just had an issue when I've messed up with subnets and aws-node was unable to start due to lack of free IPs, resulting in nodes stuck in NotReady, managed nodegroup status 'Creation Failed' and failed terraform apply. So ideally addons should either allow necessary level of customization so we can fully manage them via terraform/ekstctl/etc or be easy to adopt/replace so we can fully manage them via gitops.

tbondarchuk avatar Nov 28 '21 12:11 tbondarchuk

Just sharing to maybe help while AWS don't have an official way to customize addons. My use case is adding nodeSelectors and tolerations to coredns to ensure they get scheduled on critical mission nodes.

I got the yaml files of the coredns deployment and configmap directly form the cluster and created a helm template, after that I edited the resources and added the labels and annotations to "trick" helm to think he manages the resources that already exists on the cluster, them used the kustomize through post rendering to customize the template and add the nodeSelectors and tolerations.

This allows me to decouple my changes from AWS, whenever I have to update the yamls following this I only have to update the my helm template and in the future if AWS releases their own version of the helm template I can safely swap my on helm template implementation while keeping the specific needs of my environment.

Fabianoshz avatar Jan 11 '22 15:01 Fabianoshz

Why, when we create a new cluster, do we get the vpc-cni and coredns pods, but they aren't managed by any versioning/package system. I don't even see them when looking at installed addons. Having to remove these in such a manual way with no option to init a cluster without these resources is a HUGE drawback. Bash scripts to work around this is not a feature.

mathewmoon avatar Jan 24 '22 16:01 mathewmoon

@tabern any updates?

Hokwang avatar May 04 '22 09:05 Hokwang

@tabern if the aws-vpc-cni could match the output from the helm deployed aws-vpc-cni a lot of people would no longer be seeking this. It requires annotations and labels to be updated (so that Helm will accept ownership). While I do believe allowing customers to choose their own adventure for addon's is a good long term goal, this would be a nice quick win for a lot of people here.

Today using terraform we must either split the automation into two steps with a manual intervention in between or get into some pretty ugly custom workflow. My group is specifically just trying to configure custom networking as a component of all new cluster builds.

cdobbyn avatar May 26 '22 22:05 cdobbyn

@cdobbyn While I agree that updating labels and annotations creates a quick work around, there are already ways to hack around this problem. IMO the topic of this thread should stay focused on the issue that the API should support a bare cluster. Making changes to support making work arounds more convenient I think just obscures the real objective, which is making EKS non opinionated about what services are installed and how.

mathewmoon avatar May 28 '22 17:05 mathewmoon

@mathewmoon I agree with the goal. EKS clusters should as an advanced option allow us to deploy them bare. I suspect they deploy them with some basics for newcomers.

My comment was simply to offer a comment on a quick-win in case detaching these components is more complicated than we know. Re-reading it I recognise it appears as though I wish to alter the course of this issue (I do not).

cdobbyn avatar Jun 04 '22 02:06 cdobbyn