cloud-on-k8s Operator and cluster licensing as a unit

Background requirements / constraints / input:

Operator and clusters have a 1:1 relationship when it comes to licensing. E.g enterprise operator = enterprise cluster.
Need to make sure each cluster is only managed by a single operator at a time (no conflicts)
Be able to go from basic <-> enterprise license. (soft requirement on the operator itself, required for clusters, being able to change the license on a cluster without re-indexing/downtime etc is important)
Don't have to think about licensing by default before deploying a cluster (soft requirement, but would be nice from a UX perspective): you get what the operator is licensed with.

Proposal:

An optional setting which gives the operator an identifier. This defaults to default.

To note: It would be a semantic error to deploy two overlapping operators in the same cluster with the same operator identifier, defaulted or not -- this results in conflicts.

To note 2: Having leader election between these operators is not exactly what we'd like because it's a semantic error, and having one of them run, the other one on hold is not desired (e.g re-election potentially changes behavior). Should both of them be stopped somehow, then require an admin action to be taken? What would the procedure to resolve a conflict be in order to resume operation?

An optional annotation that can be used to specify which operator should be managing the resource. This defaults to default.

Example:

Two operators running with two different licenses, watching the same (or overlapping) namespaces.

## ECK Operator, Enterprice licensed (uses default identifier: default)
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: elastic-operator
#...
---
## ECK Operator, Basic licensed (custom identifier: my-basic)
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: elastic-operator-basic
#...
spec:
  template:
    spec:
      containers:
      - image: <OPERATOR_IMAGE>
        name: manager
        env:
          - name: OPERATOR_IDENTIFIER
            value: my-basic
        #...
      #...
    #...
  #...
#...
---
## Elasticsearch cluster
apiVersion: elasticsearch.k8s.elastic.co/v1beta1
kind: Elasticsearch
metadata:
  name: elasticsearch-sample
  annotations:
    ## if not set, is "default"
    #operator.k8s.elastic.co/managed-by: default
    ## can specify which operator:
    operator.k8s.elastic.co/managed-by: my-basic
spec:
  version: 7.4.0

Oct 21 '19 13:10 nkvoll

Not sure I follow the ux here. If I change the operator identifier to something other than default, will I also have to explicitly change the managed-by annotation for every cluster? If so I think that’s untenable for the user. I’d assume that this annotation is just set by the operator when creating the cluster?

Oct 22 '19 07:10 uric

If I change the operator identifier to something other than default, will I also have to explicitly change the managed-by annotation for every cluster?

Yes, but why would you want to do that? The operator identifier in my mind is almost immutable. We can discuss if we want to use the operator UUID by default (and let the operator populate the annotation) and if you need to switch a cluster to a different ECK instance you would have to override that annotation. But that just moves the UX problem to the point where you have to figure out the UUID of the other ECK instance in order to change the annotation.

Oct 22 '19 08:10 pebrc

Is this just to enable having multiple clusters with different licenses active in the same namespace? Otherwise isn't just using the --namespaces flag on the operator sufficient to let people have different clusters with different licenses, they just have to be in different namespaces? That seems reasonable IMO

Oct 22 '19 22:10 anyasabo

This is about moving some clusters from one type of license to another in the easiest possible way without requiring new pods to be deployed, volume claims to be swapped between pods etc. I chatted with Njal about it some more, this seems like a reasonable escape hatch but definitely not something I'd recommend as main path, which should be using namespaces.

Oct 23 '19 20:10 uric

Some thoughts and things we may have to investigate if/when we implement this:

This feature goes beyond the scope of licensing. Until now we had in mind to deploy:
- one ECK per k8s cluster
- one/many ECK per namespace in the k8s cluster
- one/many ECK for N namespaces in the k8s cluster This is deploying one/many ECK for a subset of N namespaces. Where multiple ECKs can manage multiple subsets in the same namespace.
Operators generated with Kubebuilder are now supposed to be Deployments. Meaning when we upgrade the operator from one version to another, there are actually 2 operators running at some point. To prevent any conflict happening where a resource would be reconciled by two operator simultaneously, they rely on leader election managed by the operators themselves. In the case where we have multiple operators managing resources in the same namespace, we may have to look deeper into how leader election is implemented and how can we tweak it. Probably either disable it, or tweak it in such a way that it does not prevent 2 operators from managing resources in the same namespace.
The user experience of having to change labels or annotations on the ECK manifest and on the Elasticsearch/Kibana/APM manifests seems a bit complex to me, but I guess we can build whatever scripting mechanism is required.
It's very important for the user to not set annotations in the wrong order here. Any chance of conflict where 2 instances of ECK would simultaneously manage the same Elasticsearch cluster could be dangerous for that cluster. We would loose our "expectations" guarantee if set in-memory in the first operator but not the second one. Which could lead to eg. removing nodes we should not be removing.
Operators still rely on a cached reader of the apiserver resources. Annotating an Elasticsearch resource with a particular ECK identifier is not immediately visible to all operators watching the namespace.
Not exactly sure how the UUID identifier would work if replacement operators start being deployed in different namespaces where the configmap with the UUID may not necessarily exist.
Does it impact the way users install the enterprise license? Can we have 2 different enterprise licenses installed on 2 different ECKs running in the same namespace?

I'm not saying this is not technically feasible. I'm pretty sure it is as long as we have some kind of control in a migration script. But there are many corner cases to think about here, in order to make sure we don't accidentally break clusters.

Oct 24 '19 15:10 sebgl

The user experience of having to change labels or annotations on the ECK manifest and on the Elasticsearch/Kibana/APM manifests seems a bit complex to me

I think that is not suggested here. @nkvoll 's proposal would mean only one annotation to set if you want to assign a cluster to a different ECK instance: the managed-by annotation on the Elasticsearch/Kibana/APM resource.

The ECK operator identifier is a startup parameter specified in the ECK deployment/statefulset.

If a user sets managed-by on an ES cluster to some identifier that does not exist the cluster becomes effectively unmanaged. OK in my view.
If a user sets multiple ECK instances to the same non-default identifier one of two things can happen
- all ECK instances with the same non-default identifier also share the same kubebuilder LeaderElectionID and LeaderElectionNamespace and have LeaderElection enabled then one of them will be elected leader and manage the ES cluster
- if multiple ECK instances have access to the ES cluster but are otherwise unaware of each other they will compete to manage the ES cluster which is an undesirable outcome. Even though they should eventually converge on the same result the transitions to reach that result are potentially unsafe because of our use of in-memory expectations to compensate for go-client's caching behaviour

It's very important for the user to not set annotations in the wrong order here

See above there is only one annotation to set, but as you said caching might create an unsafe time window where two operators think they are responsible for the same cluster.

This leads me to think that we should maybe not over-optimize for what we consider an edge case (i.e. the need to keep some clusters Basic-licensed) and document the procedure so that we say your cluster needs to be in a steady state when you attempt the re-assignment to a different ECK instance.

Oct 24 '19 18:10 pebrc

I think that is not suggested here. @nkvoll 's proposal would mean only one annotation to set

Yes, sorry for the confusion. I meant there is one thing to specify on the Elasticsearch/APM/Kibana side of things, and another thing (that should match) on ECK side of things (whether it's an annotation, flag, label, or anything).

Oct 25 '19 06:10 sebgl

Trying to summarize a discussion we had on Zoom with Njal about the concerns I mentioned above.

About operators that concurrently reconcile the same resources:

This a corner case problem, but it can still cause unexpected behaviours (eg. set minimum_master_nodes to a wrong value, restart a node that was already restarted, remove a node we should not remove, etc.). It is hard to solve within the operator code. One way to solve it is to make sure it never happens. For example, if you need to modify your ECK setup in a way that may cause overlapping reconciliations, you should:

Delete all ECK operators first
Annotate your Elasticsearch resources to be handled by a particular operator
Redeploy all ECK operators properly configured (eg. a default deployment with the enterprise license, and a specialized deployment for clusters matching a particular annotation)

If the user does it wrong (eg. decides to annotate a new Elasticsearch cluster a step 4), we cannot guarantee it won't have side-effects. Since this is tricky to do manually, we could (long-term) provide a helper script that takes care of it.

About the license secret and existing UUID configmap:

We may have 2 operators deployed in the same namespace. Those 2 should use a different license secret and uuid configmap, they cannot use generic eck-license and eck-uuid resources. If 2 ECK operators live in the same namespace, they will be 2 different deployments. Eg. eck-enterprise and eck-basic. From the operator container itself, we can use the downard API to retrieve the name of the Pod (eck-enterprise-hfhjhbk), and derive the name of the deployment, to use resources such as eck-enterprise-license and eck-enterprise-uuid.

This ECK identifier may or may not be used as the annotation for the Elasticsearch/APM/Kibana resources:

operator.k8s.elastic.co/managed-by: elastic-system/eck-enterprise
operator.k8s.elastic.co/managed-by: abstract-thing-that-makes-sense-to-the-user

About the leader election issue:

It is possible to customize the controller-runtime LeaderElectionID, to rely on a configmap with a different name. We can probably make that ID the ECK identifier described above.

About the user experience:

There are multiple things to thinkg about and configure (the name of the ECK deployment, the flag to pass to the specialized ECK to filter on particular clusters, the annotation to set on Elasticsearch/Kibana/APM resources). Which reminds me of how hard it is to tweak all the ECK yaml files if you want a complex ECK deployment (multiple ECK operators managing multiple namespaces, with their corresponding RBAC permissions). Long-term, if we end up having a tool to help generate complex yaml manifests, we could also handle the license/annotation thing described here as a part of it. Since we would write the tool, we could ensure there is no overlapping operators and concurrent reconciliations. It sounds like a long-term tool we may need to support complex ECK deployment stories. Not something most users would need right now.

Oct 25 '19 09:10 sebgl

What if instead of using annotations, we just allow people to specify label selectors in the operator config? So that way if people wanted they could configure their enterprise ECK operator with labelSelector: enterprise, and then it would only reconcile ES instances that had an enterprise label (for example). That seems like that would be straightforward to implement and more "native".

Oct 28 '19 16:10 anyasabo

From a user perspective it seems like it would be simpler to have a single operator that is responsible for all namespaces with the ability to specify the license type on the elasticsearch and kibana resources.

Basic License Example

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: quickstart
spec:
  license: basic
  version: 7.5.2
  nodeSets:
  - name: default
    count: 1
    config:
      node.master: true
      node.data: true
      node.ingest: true
      node.store.allow_mmap: false

Enterprise License Example

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: quickstart
spec:
  license: enterprise
  version: 7.5.2
  nodeSets:
  - name: default
    count: 1
    config:
      node.master: true
      node.data: true
      node.ingest: true
      node.store.allow_mmap: false

As an enterprise customer in this scenario, the operator would still be supported, but your clusters are only supported if they are deployed as enterprise. This solves a lot of @sebgl's concerns in https://github.com/elastic/cloud-on-k8s/issues/2032#issuecomment-545983026 as well.

Feb 03 '20 22:02 jeffspahr

@jeffspahr Was there any further discussion regarding your propsal? To me this would make perfect sense and fit our current need.

Feb 12 '21 14:02 oliverbaehler

@oliverbaehler No, it didn't go any further than my last comment. I agree that it would be a good approach! :)

Feb 13 '21 04:02 jeffspahr

@jeffspahr Should we open a new Issue regarding this matter?

Feb 15 '21 13:02 oliverbaehler

Hi, I was wondering if this is still being actively discussed/has the potential for being developed? Something like @jeffspahr proposed would be really nice in managing environments where you need multiple clusters of different licenses for testing different things.

Jul 12 '21 20:07 BenB196

Just wondering, I am currently an Enterprise customer. Does it mean that in the current implementation I cannot have both licensed and un-licensed clusters running in the same kubernetes cluster?

Feb 10 '22 22:02 zikphil

@zikphil currently, unless something has changed recently, if you want license + unlicensed Elasticsearch clusters on the same Kubernetes cluster, you will need to have multiple ECK operators on the Kubernetes cluster, then have the operators restricted to namespaces to manage the different resources.

Feb 10 '22 23:02 BenB196

Thanks!

Feb 11 '22 01:02 zikphil

It seems like a pretty common occurrence to have both non-production and production ECK clusters in a given kubernetes cluster. For our use case, we place each customer in a separate k8s namespace. Each customer gets a dev and a production env. So we are dynamically generating these k8s namespaces, which means we would need to frequently update the list of namespaces that each ECK operator manages, requiring frequent operator restarts and complicating the lifecycle of our customer projects.

I believe a more granular solution to license management is really needed!

May 16 '22 15:05 timscottbell

Any update regarding this matter? I would agree that the proposal from jeffspahr is a good solution.

Sep 26 '23 07:09 ppaslan

Yeah me too, what is the latest here?

Sep 26 '23 07:09 jbnjohnathan

cloud-on-k8s cloud-on-k8s copied to clipboard

Operator and cluster licensing as a unit

Background requirements / constraints / input:

Proposal:

Example:

cloud-on-k8s
cloud-on-k8s copied to clipboard