cortex-helm-chart Support zone awareness

trafficstars

To support zone awareness we need to create different ingesters configurations as many zones required. Currently this is not possible through the helm install.

Aug 25 '21 11:08 rverma-dev

Hi @rverma-nsl,

Thank you for reporting this issue. Can you elaborate what you are missing from the helm-chart?

Aug 25 '21 11:08 nschad

If you look at the ingester_config as provided here https://cortexmetrics.io/docs/configuration/configuration-file/#ingester_config, you will observe that for a zone awareness, currently we need to provide the az per ingester. It meant that we actually need to have 2 differenet config file and two different deployments of ingesters. I am not sure if this can be handled any other way.

Sep 10 '21 08:09 rverma-dev

If you look at the ingester_config as provided here https://cortexmetrics.io/docs/configuration/configuration-file/#ingester_config, you will observe that for a zone awareness, currently we need to provide the az per ingester. It meant that we actually need to have 2 differenet config file and two different deployments of ingesters. I am not sure if this can be handled any other way.

Yeah to be honest that sounds reasonable to me. With helm you could also have multi values.yml. So you could have one file for the general config and then two or more files with very minimal config which is basically just the cortex AZ config.

helm install cortex-az1 cortex-helm/cortex -f myconfig.yaml -f az1.yaml
helm install cortex-az2 cortex-helm/cortex -f myconfig.yaml -f az2.yaml

Otherwise I'm not sure how the cortex-helm-chart is supposed to support this and I also don't think it makes sense

Sep 10 '21 08:09 nschad

The bigger issue here is making sure the pods end-up in the zone they are configured for and fortunately the chart supports a nodeSeletor label.

Following on the example above, the problem starts when you create different deployments (or statefulsets for what matters) for each zone: they have to have different names, which the chart doesn't support. Say you add support for deployment names override, then you run into the problem with the service name and headless service too: you don't want those to have different names. Now you have to provide support for service to take independent names from the names of the deployments (currently they all take their name from cortex.fullname helper).

This problem is not only limited to ingesters, store gateway and distributors fall in the same category: different deployments are needed in each availability zone. In the end I don't like this approach, it works against K8s' pod scheduler.

Without making any change whatsoever, in the current version of the chart the PODs are evenly distributed across AZ's, which is great. The only thing that's left is to insert the AZ name in the Cortex config. I'm thinking an init container (which the chart supports adding) could do that.

Feb 02 '22 16:02 danfromtitan

The only thing that's left is to insert the AZ name in the Cortex config. I'm thinking an init container (which the chart supports adding) could do that.

I'm open to ideas if you wanna give that a go.

Feb 02 '22 17:02 nschad

@danfromtitan

they have to have different names, which the chart doesn't support

I don't really understand that point. When you use a different helm release name the name also changes also with Kubernetes namespaces I don't see any Issue. I never deployed multiple az so maybe I'm missing something.

Feb 02 '22 23:02 nschad

When you use a different helm release name the name also changes also with Kubernetes namespaces I don't see any Issue

You'd want a single helm deployment release, otherwise you'd end-up with multiple querier and distributor services for the same Cortex instance.

Feb 03 '22 00:02 danfromtitan

When you use a different helm release name the name also changes also with Kubernetes namespaces I don't see any Issue

You'd want a single helm deployment release, otherwise you'd end-up with multiple querier and distributor services for the same Cortex instance.

Ah I see. I feel like that it could be in somebody best interest in to have multiple "sets". But for memberlist you would have to configure the memberlist service to pick up multiple azs (the selector currently is quite strict). I don't think you can do that as of now without manually changing stuff. Either way it seems that this feature is not really supported by the helm-chart. I'm open to any suggestion

Feb 03 '22 07:02 nschad

Yeah, memberlist is another one, I use etcd so I don't have that problem.

The idea is to create a dynamic config file in Cortex and add the zone awareness configuration to it from an init container that has get access to kube API pods and nodes. That would require adding a cluster role with read access to pods and nodes in core API group and binding it with the cortex service account (similar to how was done for ruler sidecar).

The init container then would:

get the node name where the pod's running
get the AZ name from the node's topology zone label
update availability zone in config file

For init container to write a file, the config file would have to be a file on an emptyDir mount. Currently that comes from a read-only secret or configmap. So we need the ability in the chart to override the config filename, something like:

{{- define "cortex.configFileName" -}}
{{- default "/etc/cortex/cortex.yaml" .Values.configFileName -}}
{{- end -}}

The init container would copy the read-ony config that comes from configmap and replace the AZ name. Thoughts ?

Feb 03 '22 17:02 danfromtitan

Yeah, memberlist is another one, I use etcd so I don't have that problem.

The idea is to create a dynamic config file in Cortex and add the zone awareness configuration to it from an init container that has get access to kube API pods and nodes. That would require adding a cluster role with read access to pods and nodes in core API group and binding it with the cortex service account (similar to how was done for ruler sidecar).

The init container then would:

get the node name where the pod's running

get the AZ name from the node's topology zone label

update availability zone in config file

For init container to write a file, the config file would have to be a file on an emptyDir mount. Currently that comes from a read-only secret or configmap. So we need the ability in the chart to override the config filename, something like:
{{- define "cortex.configFileName" -}}
{{- default "/etc/cortex/cortex.yaml" .Values.configFileName -}}
{{- end -}}
The init container would copy the read-ony config that comes from configmap and replace the AZ name. Thoughts ?

I like the approach of parsing topology.kubernetes.io/zone of the current node running the pod. ~However the config file is kinda hacky and the config file would still be the old in the secret/configmap. I not even sure if those overriden changes in the init-container persist through the non-init container.~

Wait..What are you trying to do exactly here with the configmap? Are you trying to mount the non-az-config and then alter and save that as az-aware-config for use later with the cortex components?

Do you have some thought's on this @kd7lxl?

Feb 03 '22 18:02 nschad

I think a mutating admission controller might be a better way to inject the zone name into the cortex container arguments or environment. Surely there is an existing admission controller that is capable of injecting a node label into a pod environment, like https://github.com/Collaborne/kubernetes-aws-metadata-service but ideally one that is still maintained. Otherwise, we can write one. Then we can configure Cortex to read that env var.

Related reading: https://github.com/elastic/cloud-on-k8s/issues/3933#issuecomment-950581345

Feb 03 '22 18:02 kd7lxl

mutating admission controller

~Okay interesting so we basically watch for resource "node"~ (didn't know about that binding thing, even better) and then call our webhook which will then read topology.kubernetes.io/zone of that node and then update our cortex ingesters by setting an ENV variable? Sounds cool.

Update: we also would have to support a way on how to configure memberlist with the 2nd and 3rd zone

Feb 03 '22 18:02 nschad

Do you wanna give that a go? @danfromtitan

Would be great if we could just use something that already exists

Feb 07 '22 18:02 nschad

Do you wanna give that a go

I'm looking into it. Do you want the webhook config/deployment added to the chart with a zone_aware_enabled flag around it ?

Would be great if we could just use something that already exists

There ain't any that does exactly what we want here, but I can change an existing one. Either way, the admission controller image is asking for a separate project: how do you want to go about that ?

There is an option to keep everything outside this chart and deploy the admission controller in a separate namespace.

Feb 07 '22 19:02 danfromtitan

I'm looking into it. Do you want the webhook config/deployment added to the chart with a zone_aware_enabled flag around it ?

:+1: sounds good.

There ain't any that does exactly what we want here, but I can change an existing one. Either way, the admission controller image is asking for a separate project: how do you want to go about that ?

Yeah I feared this might be the case. I guess I could ask if we could get a seperate repo in the cortexproject space. And then we could use github actions to build/deploy the docker image and maybe even github registry.

There is an option to keep everything outside this chart and deploy the admission controller in a separate namespace.

I'm all ears. I mean we could do our part here and maybe have a guide that references the controller that you would have to setup yourself?

Feb 07 '22 19:02 nschad

I mean we could do our part here and maybe have a guide that references the controller that you would have to setup yourself

That'd be the cleanest approach, the chart project is already overloaded (ruler sidecar is an example that comes to mind).

Feb 07 '22 19:02 danfromtitan

@kd7lxl - there is one fundamental issue with the idea of using an admission controller: at the time the API request makes it through the controller, there is no decision yet by the scheduler about the worker node where the pod would be deployed, meaning the AZ cannot be determined at that time.

It seems to me an init container is the only way to determine the AZ, because by the time init container is started, the pod has been scheduled to a given node. Without going into nasty hardwiring, I can't think of any other way to tell a cortex container the AZ where it's running.

Feb 08 '22 00:02 danfromtitan

@kd7lxl - there is one fundamental issue with the idea of using an admission controller: at the time the API request makes it through the controller, there is no decision yet by the scheduler about the worker node where the pod would be deployed, meaning the AZ cannot be determined at that time.

The pod/binding event looked promising. Have you tried that?

It seems to me an init container is the only way to determine the AZ, because by the time init container is started, the pod has been scheduled to a given node. Without going into nasty hardwiring, I can't think of any other way to tell a cortex container the AZ where it's running.

That surely exists right? Or would we also have to build that ourselves.

Feb 08 '22 00:02 nschad

@kd7lxl - there is one fundamental issue with the idea of using an admission controller: at the time the API request makes it through the controller, there is no decision yet by the scheduler about the worker node where the pod would be deployed, meaning the AZ cannot be determined at that time.

For an admission controller, I think the sequence would have to be like this:

Receive Pod CREATE AdmissionReview
Patch pod with custom envFrom ConfigMap that does not yet exist
Pod gets scheduled, but will be blocked from starting until the ConfigMap exists
Receive ~Pod UPDATE~ Binding AdmissionReview (triggered by the scheduling event that sets the pod's .nodeName)
Use the k8s API to get the topology label value of the node.
Store the topology value in the ConfigMap
Pod starts with envFrom ConfigMap

I haven't tested this to confirm the sequence and events all work. Maybe I missed something.

It seems to me an init container is the only way to determine the AZ, because by the time init container is started, the pod has been scheduled to a given node. Without going into nasty hardwiring, I can't think of any other way to tell a cortex container the AZ where it's running.

initContainer would work too, but you still have to come up with a way to get the result of the initContainer into the environment of the primary container. I really don't like having to use wrapper scripts to replace a container's entrypoint, nor do I really want to give the initContainer permission to create ConfigMaps.

The reason I like the admission controller approach is that the only thing that needs permission to mutate pods and create config maps is the admission controller, not the app itself.

This is a commonly requested feature: https://github.com/kubernetes/kubernetes/issues/62078 (and others) If we can provide a good solution I'm sure many people will appreciate it.

Feb 08 '22 01:02 kd7lxl

We can also use https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/ to ensure that pods are spread across as and then use the configmap with envfrom annotation value which would be correctly assigned.

Feb 08 '22 01:02 rverma-dev

I'm not sure the API service would run the resource admission twice through the admission controller between steps 3 and 4 like you've described.

In fact what I see happening at step 3 is the POD resource is not blocked from starting, but it's rather created and sits there looping with an error "configmap not found". Once the error condition is cleared by creating the configmap, there is no UPDATE admission request send to the controller, the POD rather follows through and it completes the start-up sequence. I tried this twice, with a standalone POD and pods from a deployment, same outcome.

An UPDATE operation could be initiated from an outside request (i.e. helm chart update or kubectl apply) but not by having the admission controller sit in a loop.

In the subsequent UPDATE operation I can see indeed the nodeName making it in the admission request, but that is the node where POD runs at the time before update would get applied. I guess after update the scheduler might move the POD to a different node (i.e. let's say because node capacity has been changed by unrelated PODs), so we still won't know where the POD lands next.

Providing this would somehow work, there is still an open question of what happens with legit deployment updates: the trick with missing configmap won't work a second time since now the resource already exists.

By contrast an initContainer is a bit simpler and it's guaranteed to work all the time since it runs in the same POD:

create a cluster role with read access to pods and nodes in core API group
bind the cluster role with cortex service account
add an init container that makes curl requests to kube API to find the node where the pod's running and get the node topology
in the same init container, copy the read-only cortex configuration coming from a secret/configmap to a read-write emptyDir volume
init container updates the zone name in place and terminates. In-memory volume with dynamic configuration remains.
cortex container starts and loads the configuration from the volatile volume

The only update that's needed in the chart is to change the hardcoded configuration file in the cortex container arguments with a parameter value, i.e.

args:
            - "-target=compactor"
            - "-config.file={{ include "cortex.configFileName" . }}"

where

{{- define "cortex.configFileName" -}}
{{- default "/etc/cortex/cortex.yaml" .Values.configFileName -}}
{{- end -}}

Feb 08 '22 03:02 danfromtitan

use pod-topology-spread-constraints to ensure that pods are spread across

There is no problem spreading PODs evenly across AZ's with the current anti-affinity rules in the chart. The issue is telling cortex distributors/queriers about where the ingesters run, just so time series could be replicated across different AZ's and not being lost in case of a zone failure.

Having read through that part in the Cortex code that implements the writing across AZ's, I'm not even sure zone awareness is worth the effort of implementing it: the distributors would merely attempt a fair distribution of times series within the replication factor configured and won't make any attempt to keep the traffic in the same AZ and save on the cross-zone data transfer cost.

Feb 08 '22 04:02 danfromtitan

The pod/binding event looked promising. Have you tried that?

Couldn't find any "pod/binding" resource in the API reference. The implementation of binding a pod to machine doesn't look like it leaves room for catching/updating the request in the outside admission controler.

Feb 08 '22 05:02 danfromtitan

The pod/binding event looked promising. Have you tried that?

Couldn't find any "pod/binding" resource in the API reference. The implementation of binding a pod to machine doesn't look like it leaves room for catching/updating the request in the outside admission controler.

https://github.com/elastic/cloud-on-k8s/issues/3933#issuecomment-950581345

Feb 08 '22 07:02 nschad

The pod/binding event looked promising. Have you tried that?

Couldn't find any "pod/binding" resource in the API reference. The implementation of binding a pod to machine doesn't look like it leaves room for catching/updating the request in the outside admission controler.

It's called pods/binding and the event looked like this when I tested it:

apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
...
webhooks:
...
  rules:
  - apiGroups:
    - ""
    apiVersions:
    - v1
    operations:
    - CREATE
    resources:
    - pods
    - pods/binding
    scope: '*'

{
  "kind": "AdmissionReview",
  "apiVersion": "admission.k8s.io/v1beta1",
  "request": {
    "uid": "645135a1-48c9-4550-8ca0-9e5d1343c3d5",
    "kind": {
      "group": "",
      "version": "v1",
      "kind": "Binding"
    },
    "resource": {
      "group": "",
      "version": "v1",
      "resource": "pods"
    },
    "subResource": "binding",
    "requestKind": {
      "group": "",
      "version": "v1",
      "kind": "Binding"
    },
    "requestResource": {
      "group": "",
      "version": "v1",
      "resource": "pods"
    },
    "requestSubResource": "binding",
    "name": "nginx",
    "namespace": "default",
    "operation": "CREATE",
    "userInfo": {
      "username": "system:kube-scheduler",
      "groups": [
        "system:authenticated"
      ]
    },
    "object": {
      "kind": "Binding",
      "apiVersion": "v1",
      "metadata": {
        "name": "nginx",
        "namespace": "default",
        "uid": "75fc040e-c49e-4418-8525-674c4deeabbc",
        "creationTimestamp": null,
        "managedFields": [
          {
            "manager": "kube-scheduler",
            "operation": "Update",
            "apiVersion": "v1",
            "time": "2022-02-08T03:16:50Z",
            "fieldsType": "FieldsV1",
            "fieldsV1": {
              "f:target": {
                "f:kind": {},
                "f:name": {}
              }
            }
          }
        ]
      },
      "target": {
        "kind": "Node",
        "name": "kind-control-plane"
      }
    },
    "oldObject": null,
    "dryRun": false,
    "options": {
      "kind": "CreateOptions",
      "apiVersion": "meta.k8s.io/v1"
    }
  }
}

No mutation to this resource is required -- the artifact of this event would be creating a ConfigMap.

Feb 08 '22 16:02 kd7lxl

@kd7lxl , there are couple of issues with putting the env vars in a config map:

different Zone values ask for multiple config maps
config maps need to be cleaned-up after pods are deleted

Instead of configmaps I was going to explore the idea of adding container env vars from the binding admission request.

Do you know whether the mutation is possible from the binding event ? I'm seeing the error below while the pod sits in pending and I'm not sure if mutation is not allowed or there is another reason.

 Warning  FailedScheduling  65s (x2 over 67s)  default-scheduler  binding rejected: running Bind plugin "DefaultBinder": Internal error occurred: add operation does not apply

Feb 10 '22 18:02 danfromtitan

NVM, I replied my own question: the mutate admission applies to the object kind in the request, in this case "kind": "Binding", so it would not patch the pod here.

Feb 10 '22 20:02 danfromtitan

I agree that configmaps are not needed. Labeling the pod works just fine -- no need to create another resource.

I wrote a little proof of concept here: https://gist.github.com/kd7lxl/1997a4d162bd49d4dbdb8f69506a1e01. It does the whole process with initContainers, but I've annotated it with which concerns I'd like to move to an admission controller to be able to reduce the permissions granted to the pod.

I would not return a mutation/patch from the pods/binding AdmissionReview (pretty sure that wouldn't work), rather just use the event to trigger a labeling operation on the pod with a client. It's just slightly simpler than watching all pod events.

Feb 10 '22 20:02 kd7lxl

I had a peak at some admission webhook examples and the whole setup can be a bit tricky, especially since you need a proper certificate for your webhook server. Which we can automate with stuff like this: https://github.com/newrelic/k8s-webhook-cert-manager (maybe not exactly this) but this sure as hell won't make things easier. So I'm asking myself at what point does this become an entire software suite :D

Feb 10 '22 23:02 nschad

A noob question, Can't we use istio, they have zone aware routing? And reduce the zone awareness functionalities from cortex? Similar functionalities are there in cilium too.

Feb 11 '22 01:02 rverma-dev

cortex-helm-chart cortex-helm-chart copied to clipboard

Support zone awareness

cortex-helm-chart
cortex-helm-chart copied to clipboard