cloud-on-k8s Data Transport Cert Secret Size Overrun With Big Scale Out

Bug Report

What did you do?

Attempted to scale the data replicas to 250.

What did you expect to see?

successful scale up

What did you see instead? Under which circumstances?

It appears that the ECK operator will overflow the max k8s secret size (1MB) for the transport certs if you scale the data nodes to >250.
The operator gets stuck in a scale up loop while it tries to reconcile the cert secret. Even after scaling down the operator does not seem to recover.

"Secret "elasticsearch-XXX-es-data-es-transport-certs" is invalid: data: Too long: must have at most 1048576 bytes" error

Failed remediations

issue transport cert as documents here
issue wildcard transport certs as documents here

Environment

ECK version: 2.8.0
Kubernetes information:
- Cloud: GKE v1.26.3-gke.1000
kubectl version: v1.27.2
Resource definition:

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: elasticsearch-myapp
spec:
  version: 8.6.1
  http:
    tls:
      selfSignedCertificate:
        disabled: true
  nodeSets:
  - config:
      action:
        auto_create_index: false
      node.roles:
      - master
    count: 3
    name: election
    podTemplate:
      metadata:
        annotations:
          linkerd.io/inject: enabled
        labels:
          ec.ai/component: elasticsearch
          ec.ai/component_group: myapp-service
      spec:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: cloud.google.com/gke-spot
                  operator: DoesNotExist
          podAntiAffinity:
            preferredDuringSchedulingIgnoredDuringExecution:
            - podAffinityTerm:
                labelSelector:
                  matchExpressions:
                  - key: elasticsearch.k8s.elastic.co/cluster-name
                    operator: In
                    values:
                    - elasticsearch-myapp
                topologyKey: topology.kubernetes.io/zone
              weight: 100
            requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                - key: elasticsearch.k8s.elastic.co/cluster-name
                  operator: In
                  values:
                  - elasticsearch-myapp
              topologyKey: kubernetes.io/hostname
        automountServiceAccountToken: true
        containers:
        - name: elasticsearch
          resources:
            limits:
              cpu: "2"
              memory: 5Gi
            requests:
              cpu: "1"
              memory: 5Gi
        initContainers:
        - command:
          - sh
          - -c
          - sysctl -w vm.max_map_count=262144
          image: busybox:1.28
          name: sysctl
          securityContext:
            privileged: true
        - command:
          - sh
          - -c
          - bin/elasticsearch-plugin install --batch analysis-icu
          name: analysis-icu
        - command:
          - sh
          - -c
          - bin/elasticsearch-plugin install --batch repository-gcs
          name: repository-gcs
        priorityClassName: app-critical-preempting
        serviceAccount: myapp-elasticsearch
        serviceAccountName: myapp-elasticsearch
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 8Gi
        storageClassName: standard-rwo
  - config:
      action:
        auto_create_index: false
      node.roles:
      - data
    count: 200
    name: data
    podTemplate:
      metadata:
        annotations:
          linkerd.io/inject: enabled
        labels:
          ec.ai/component: elasticsearch
          ec.ai/component_group: myapp-service
      spec:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: node_pool
                  operator: In
                  values:
                  - n2d-custom-8-65536
          podAntiAffinity:
            preferredDuringSchedulingIgnoredDuringExecution:
            - podAffinityTerm:
                labelSelector:
                  matchExpressions:
                  - key: elasticsearch.k8s.elastic.co/cluster-name
                    operator: In
                    values:
                    - elasticsearch-myapp
                topologyKey: topology.kubernetes.io/zone
              weight: 100
            requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                - key: elasticsearch.k8s.elastic.co/cluster-name
                  operator: In
                  values:
                  - elasticsearch-myapp
              topologyKey: kubernetes.io/hostname
        automountServiceAccountToken: true
        containers:
        - name: elasticsearch
          resources:
            limits:
              cpu: "7"
              memory: 56Gi
            requests:
              cpu: "7"
              memory: 56Gi
        initContainers:
        - command:
          - sh
          - -c
          - sysctl -w vm.max_map_count=262144
          image: busybox:1.28
          name: sysctl
          securityContext:
            privileged: true
        - command:
          - sh
          - -c
          - bin/elasticsearch-plugin install --batch analysis-icu
          name: analysis-icu
        - command:
          - sh
          - -c
          - bin/elasticsearch-plugin install --batch repository-gcs
          name: repository-gcs
        priorityClassName: app-high-preempting
        serviceAccount: myapp-elasticsearch
        serviceAccountName: myapp-elasticsearch
        tolerations:
        - effect: NoSchedule
          key: n2d-custom-8-65536
          operator: Equal
          value: "true"
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 500Gi
        storageClassName: standard-rwo

Logs:

Continuous loop of reconciliation failures and timeout accompanied by the following.

 Secret "elasticsearch-myapp-es-data-es-transport-certs.v1" is invalid: data: Too long: must have at most 1048576 character

Jun 23 '23 23:06 rtluckie

One thing you can do to work around this limitation is to create multiple node sets with the data role and scale each of those up until you start running into the size limitation of k8s secrets which seems to be around 150-200 nodes. You can then keep adding node sets until reach the desired scale. See this issue for more context on the current model of one secret for transport certificates per node set.

Jun 24 '23 13:06 pebrc

@pebrc are there any plans to address this? it's been several years since the workaround was implemented. we run a very large deployment of many ES clusters (of which this operator has been fantastically helpful), so when adding some of our more larger clusters, i bumped into this error. quite a surprise, you can imagine.

Mar 20 '24 18:03 nullren

I'm wondering if we could stop reconciling that Secret if we use a CSI driver to manage the certificates for example? (Or give an option to the user skip the reconciliation of that Secret?)

Mar 21 '24 11:03 barkbay

@barkbay I think that's a good idea.

@nullren we don't have concrete plans to address this right now. Did the workaround, using multiple node sets instead of one big one, have drawbacks for you that made you want to stick with a single node set?

Mar 22 '24 08:03 pebrc

@barkbay I think that's a good idea.

@nullren we don't have concrete plans to address this right now. Did the workaround, using multiple node sets instead of one big one, have drawbacks for you that made you want to stick with a single node set?

The work around did "work", but it is a whole lot of unnecessary complexity for something we don't even use (we disable security and dont use the certs at all as we use our own network framework on k8s). There's just a lot of extra tooling we have to update to ensure that node sets "data-0", "data-1", ..., "data-N" are all found and reconciled correctly. Still finding some bugs due to this.

Apr 03 '24 23:04 nullren

We have implemented an option to turn off the ECK managed self-signed certificates in https://github.com/elastic/cloud-on-k8s/pull/7925 which is going to ship with the next release of ECK. This should cover the case you mentioned @nullren. This means we now have two workarounds for large clusters:

Either:

split a node set into mulitple node sets or
disable the transport certs and provision them externally (e.g. with cert-manager)

My vote would be to close this issue unless there are additional concerns we did not address with these changes.

Jul 23 '24 09:07 pebrc

@pebrc that works for me. Thank you!

Jul 24 '24 17:07 nullren