opentelemetry-helm-charts Helm chart for Kubernetes metrics quickstart

trafficstars

Many prometheus and kubernetes users are familiar with the kube-prometheus-stack chart which aims to quickly set up and manage a prometheus and grafana installation for a user that collects mostly all of the Kubernetes metrics available. It achieves this using the Prometheus operator and ServiceMonitor and PodMonitor custom resources that configure a user's Prometheus scrape config. We have the ability to do the same using the OpenTelemetry Operator and the Target Allocator. In order to provide an easy and familiar migration path to existing (or new) Prometheus and Kubernetes users, I created the kube-otel-stack chart which installs a pre-configured collector and target allocator to dynamically ServiceMonitor and PodMonitor custom resources to scrape various Kubernetes metrics. You can see below some of the metrics this collector is scraping. Screen Shot 2022-12-12 at 6 06 49 PM

This has since become a requested feature across the otel slack from what i can tell, as I've DM'ed this chart to at least 3 different people at this point. I was wondering if it would be welcome for me to clean up and make more generic this slightly opinionated helm chart and donate it to the repository.

Other options considered

Add a new preset to the existing collector chart
- I decided not to do this for two reasons:
  - My chart utilizes / requires the target allocator's CRD discovery functionality, which in turn requires the Operator to run. The CRD functionality of this chart is also one of its biggest benefits as it allows users coming from an existing Prometheus installation to easily migrate.
  - Even without using the CRD functionality, the scrape configs required are very long and may not work for all users without some tweaking
Finish writing #336 and then add the functionality in as a preset
- As I've mentioned in #334, completing that PR is relatively difficult given the logic the operator is doing to generate the Target Allocator's configuration and I don't currently have time to implement it
- Given I already have the chart for the kube-otel-stack created and people seem interested in it, this is a much lower lift to solve users issues

TODO

[x] #1075
[x] #1076
[x] #1077
[x] #1078
[x] #1079
[ ] #1080
[ ] #1081

Dec 12 '22 23:12 jaronoff97

I've seen O(tens) of requests for this on the OpenTelemetry slack channels. Having it in the community would be great, as we could promote its adoption more widely.

Dec 13 '22 00:12 austinlparker

I am certainly interested in this if users are interested in this. A couple questions:

@jaronoff97 if this chart was accepted, would you be available as a CodeOwner for the chart?
Is there anything specific to Lightstep that would need stripped out or can the entire chart be taken verbatim?
Is the chart testable via chart-testing?
What has the upkeep of the chart been like? Is it relatively stable (except for operator bumps)?

Dec 13 '22 20:12 TylerHelmuth

Thanks for your questions :)

Yes, happy to be a codeowner for it.
Yes, I would generalize anything that is LS specific in the PR i would make to the repo
I'm not sure how chart-testing works (never used it before.) I think as long as we could install the operator as part of the testing flow, it should be fine?
Relatively stable, occasionally there's a small change here and there. I'd imagine we'd get some more requests as more people use this, but it shouldn't be changing too drastically

Dec 14 '22 15:12 jaronoff97

I really like this idea, but I have a question - is there a plan to move away from kube-state-metrics, node-exporter etc in favour of otel collector native receivers (k8sclusterreceiver and hostmetrics) ?

I think in general we should strive to collect all the prometheus metrics from k8s components, but not use any of the Prometheus ecosystem components and use Collector's native features :)

Dec 27 '22 11:12 povilasv

@jaronoff97 I'm also curious if your chart handles the installation of the operator and the OpentelemetryCollector object like discussed here: https://github.com/open-telemetry/opentelemetry-helm-charts/issues/69

Jan 10 '23 23:01 TylerHelmuth

I have been using this chart for 3 weeks, it is working out of the box but it will need to be improved (of course). It brings almost the same functionalities as "Prometheus Operator with kube-prometheus-stack chart". It is much lightweight as you only deploy "agents" to scrape your logs/metrics/traces. I am using it to send metrics to AWS AMP (managed prometheus).

Here are the main issue I encountered so far :

if I use "statefulset" deployment and one of the Availibility zone" goes down( 1 / 3), I loose the scraping on 1/3 of the targets :(
I did not know which version of "Prometheus CRDs" to install, could it be documented which version is supported by the targetAllocator.

Thanks for the good work.

Jan 16 '23 17:01 jcdauchy-moodys

updates/context setting: @TylerHelmuth I still want to donate this if that's still okay. I've validated with a few other people that this would be a great thing for the community to have. The only blocker for this work is to figure out if we can install the operator in the same chart which would make for a better experience. My team is going to be investigating this.

Mar 09 '23 22:03 jaronoff97

@jaronoff97 sounds good. @open-telemetry/helm-approvers please add your thoughts.

Mar 13 '23 18:03 TylerHelmuth

I approve. Thanks @jaronoff97

Mar 13 '23 18:03 Allex1

I don't think I agree that we need another chart for this. I'd rather go with adding the TA option to the collector chart.

Also, why do we promote using Prometheus for scraping kubernetes/kubelet metrics instead of using specialized collector receivers that collect metrics complaint with OTel semantic conventions without additional transformations?

Mar 14 '23 01:03 dmitryax

I think this would provide a bridge for existing kps users that otherwise would not care to switch (afaik Prometheus is still used in ~ 99.x% of Kubernetes deployments for cluster monitoring). Reusing the existing Prometheus-Operator objects would smooth out that migration.

Mar 14 '23 07:03 Allex1

I also see value in a "transition" chart. Long term (like long long term), I think a need for a chart like this diminishes, but for users today who have extensive Prometheus setups but want to try out OTel or start transitioning to OTel I think this chart fits their needs.

Mar 14 '23 14:03 TylerHelmuth

Ok, I'm not blocking it. If most @open-telemetry/helm-approvers think it's a good addition, let's add it

Mar 14 '23 20:03 dmitryax

The name should somehow reflect the Prometheus bridge/transition in its name. kube-otel-stack doesn't seem right to me

Mar 14 '23 20:03 dmitryax

Could also be cool to include somewhere how to grab the same telemetry using the collector and its components.

Mar 14 '23 21:03 TylerHelmuth

I'm not sure how this transitioning chart would work? Should we assume that user installed kube-prometheus-stack and we try to somehow migrate it from that to this chart?

I was thinking having kube-otel-stack which initially works like kube-prometheus-stack, collects metrics using Prometheus, but slowly we could refactor it to use native OpenTelemetry Collector receivers and functionality.

Mar 15 '23 09:03 povilasv

I'm not sure how this transitioning chart would work? Should we assume that user installed kube-prometheus-stack and we try to somehow migrate it from that to this chart?

We should probably assume that the majority of admins scrape their k8s api endpoints with Prometheus via prometheus-operator objects like Service/PodMonitor that we can reuse with this stack. As such a user, initially I would have both Prometheus and otel collector scraping this data and comparing the results/setup complexity before making any decision.

Mar 15 '23 10:03 Allex1

I would also see this as a 'transition' chart, but the migration path to me is something like...

kube-prometheus-stack -> kube-otel-stack -> opentelemetry-operator

In the (admittedly, kinda far?) future, I can see the operator using native OpenTelemetry components and monitoring CRDs to perform the same basic functions as this stack, but in the short-to-medium term, having this in the org will give us a pat answer for "how should I monitor k8s with OpenTelemetry?"

Mar 31 '23 15:03 austinlparker

Hi, quick bump on this issue - one pretty common piece of feedback we got at KubeCon EU was the amount of people who didn't know the operator existed. I believe getting this chart brought in would help a lot with that, as we could then signpost this from the docs as a "how to get started with kubernetes".

Apr 24 '23 12:04 austinlparker

@dmitryax is there anything else we're waiting on before accepting PRs adding this chart?

Apr 24 '23 14:04 TylerHelmuth

@TylerHelmuth I think this issue is still a blocker. I'm going to run some tests right now to track this down and solve it.

Apr 24 '23 15:04 jaronoff97

Okay after a little mish-moshing of things... i was able to get a chart that installs cert-manager (a requirement of the operator), the operator, and a collector to install together in a single chart. The problem is that it doesn't all install at once for a few reasons.

Option where we install cert-manager with the chart

TL;DR there are some race conditions and annoyances here

First installation

In order for the first installation to work for the chart, you need to set the operator's admission webhook to false. This is because helm installs resources in a particular order (here) and if you attempt to install cert-manager and the operator simultaneously with the webhook enabled you get the following error:

Error: INSTALLATION FAILED: unable to build kubernetes objects from release manifest: [unable to recognize "": no matches for kind "Certificate" in version "cert-manager.io/v1", unable to recognize "": no matches for kind "Issuer" in version "cert-manager.io/v1"]

This is fine, because we can just initially disable the webhook on otel-operator installation so the otel-operator can come up healthy after the CRDs for cert-manager are installed.

Second installation

Now we have to re-enable the webhook, applying that again will get you another fun group of errors.

⎨ 11:46:28⎬ ⎨ ⛵️kind-kind : kind-kind⎬ ⎨ ...opentelemetry-helm-charts/charts/kube-otel-stack⎬ ⎨  same-chart-operator-install ✘ ✭⎬
⫸ helm install kube-otel-stack . -f values.yaml
Error: INSTALLATION FAILED: Internal error occurred: failed calling webhook "mopentelemetrycollector.kb.io": failed to call webhook: Post "https://opentelemetry-operator-webhook-service.default.svc:443/mutate-opentelemetry-io-v1alpha1-opentelemetrycollector?timeout=10s": dial tcp 10.96.94.177:443: connect: connection refused

⎨ ✘⎬ ⎨ 11:46:43⎬ ⎨ ⛵️kind-kind : kind-kind⎬ ⎨ ...opentelemetry-helm-charts/charts/kube-otel-stack⎬ ⎨  same-chart-operator-install ✘ ✭⎬
⫸ helm upgrade kube-otel-stack . -f values.yaml
Error: UPGRADE FAILED: failed to create resource: Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://kube-otel-stack-cert-manager-webhook.default.svc:443/mutate?timeout=10s": dial tcp 10.96.176.233:443: connect: connection refused

These are due to pods not being ready in order for the webhooks to be called.

Third installation

After waiting maybe ten seconds, instead of being impatient like me... you are able to successfully install the chart in its entirety

⫸ helm upgrade kube-otel-stack . -f values.yaml --install
Release "kube-otel-stack" has been upgraded. Happy Helming!
NAME: kube-otel-stack
LAST DEPLOYED: Mon Apr 24 11:56:05 2023
NAMESPACE: default
STATUS: deployed
REVISION: 3

Option where we assume cert-manager is pre-installed

Given most clusters will already have cert-manager installed, here's what the installation process would look like...

A bit smoother, but still the same webhook race condition at the end

First installation

⫸ helm upgrade kube-otel-stack . -f values.yaml -n kube-otel-stack --create-namespace --install
Release "kube-otel-stack" does not exist. Installing it now.
Error: Internal error occurred: failed calling webhook "mopentelemetrycollector.kb.io": failed to call webhook: Post "https://opentelemetry-operator-webhook-service.kube-otel-stack.svc:443/mutate-opentelemetry-io-v1alpha1-opentelemetrycollector?timeout=10s": dial tcp 10.96.102.5:443: connect: connection refused

Trying again after a few seconds...

⫸ helm upgrade kube-otel-stack . -f values.yaml -n kube-otel-stack --create-namespace --install
Release "kube-otel-stack" has been upgraded. Happy Helming!
NAME: kube-otel-stack
LAST DEPLOYED: Mon Apr 24 12:00:13 2023
NAMESPACE: kube-otel-stack
STATUS: deployed
REVISION: 2

Proposed remediations

It's possible that the operator is responding "ready" too quickly, which would cause this issue (kubernetes issue). If we were to modify the operator's readiness probe on installation we may be able to fix this.
- Thought: this is probably the "correct" thing to do, but it's unclear to me if this will permanently fix the problem
Setting the failurePolicy on the MutatingWebhookConfiguration object to Ignore could also solve this on first install
- Thought: this is potentially dangerous as the mutating webhook for setting defaults could silently fail going forward. Upon testing this theory by setting the following

opentelemetry-operator:
  admissionWebhooks:
    failurePolicy: 'Ignore'

The operator and collector installed together successfully! An end user using this chart could just as easily enable the mutating webhook post-install as well, but that's not an ideal experience IMO.

I would love to hear thoughts on this, and see if there's anything I missed in my findings here. cc @open-telemetry/helm-maintainers

Apr 24 '23 16:04 jaronoff97

For the cert manager my preference would be to copy whatever pattern kube-prometheus-stack is using. If we can't install the cert manager as part of the chart install that will at least follow our existing pattern for the operator, although there is an issue opened about that friction: https://github.com/open-telemetry/opentelemetry-helm-charts/issues/550

Setting the failurePolicy on the MutatingWebhookConfiguration object to Ignore

When I investigated this a while ago this is the solution I stumbled upon and I believe it is the solution that kube-prometheus-stack uses.

Apr 26 '23 14:04 TylerHelmuth

Looking as to what the kube-prometheus-stack does right now.

Apr 26 '23 14:04 jaronoff97

It looks like it's configurable (obv) It's default behavior is empty and enabled, which means the policy is going to be set to Ignore so I think that seems reasonable for us to do.

They also recommend pre-installing cert-manager on a cluster to use these webhooks.

Apr 26 '23 14:04 jaronoff97

Seeing as the chart is trying to follow the same pattern for value I think it makes sense to follow the same technical patterns as well.

Apr 26 '23 15:04 TylerHelmuth

Agreed. I can work on it this week and next week to match those expectations. I'll include some docs about these decisions as well.

Apr 26 '23 15:04 jaronoff97

I believe it is the solution that kube-prometheus-stack uses.

Yes, Indeed

https://github.com/prometheus-community/helm-charts/blob/2f23626a3e7866b2334c53b37aac8b7c156b691f/charts/kube-prometheus-stack/templates/prometheus-operator/admission-webhooks/mutatingWebhookConfiguration.yaml#L20

https://github.com/prometheus-community/helm-charts/blob/2f23626a3e7866b2334c53b37aac8b7c156b691f/charts/kube-prometheus-stack/values.yaml#L2079

Nov 23 '23 05:11 JaredTan95

Is this something someone is still working on? Given how complex the whole ecosystem was to grasp for me starting out, what would makes the most sense from my perspective is have some way to add presets into the Opentelemetry Operator.

IMO if someone wants to plug in Otel to their cluster most likely they'll want to have the ability to get:

Traces
Pod metrics

It would be ideal if the default setup of the operator easily allowed you to get a setup like the one Honeycomb suggests in their getting started

Feb 27 '24 10:02 ferrucc-io

@ferrucc-io yes I'm still working on this, I've had a whole slew of other priorities that keep taking precedence.

Feb 27 '24 14:02 jaronoff97

opentelemetry-helm-charts opentelemetry-helm-charts copied to clipboard

Helm chart for Kubernetes metrics quickstart

Other options considered

TODO

Option where we install cert-manager with the chart

First installation

Second installation

Third installation

Option where we assume cert-manager is pre-installed

First installation

Proposed remediations

opentelemetry-helm-charts
opentelemetry-helm-charts copied to clipboard