Improve Operator Pod Mutation Observability
Component(s)
auto-instrumentation
Is your feature request related to a problem? Please describe.
I'm currently working on rolling out the OpenTelemetry Operator across all of the kubernetes (OpenShift) clusters in our environment. The capability of auto-instrumenting our application workloads will become crucial in our ability to support our systems. If something happens to the operator that results in pods NOT getting auto-instrumented, we'd potentially be "flying blind".
I'd like the ability to have finer insights into the counts of auto-instrumentation attempts and failures to build the proper alerting (SLOs).
Describe the solution you'd like
Instrument the pod mutator to create/increment metrics that indicate that a pod contains the instrumentation annotation and is subject to receive auto-instrumentation. Some initial ideas on the types of scenarios/metrics to expose:
- pod contained instrumentation/sidecar annotation (may or may not be valid config) -> increment some counter saying "the podmutator will attempt to process"
- pod contained invalid "inject" type -> pod mutation didn't happen, increment a counter to reflect this scenario
- pod contained invalid instrumentation or sidecar reference in the annotation value -> pod mutation didn't happen, increment a counter to reflect this scenario
- pod contained valid instrumentation or sidecar annotation/reference, but an unexpected error occurred -> pod mutation failed, increment a counter to reflect
I know some of these scenarios may be available in container or kubernetes logs, but for managing a fleet of operator across multiple clusters is much easier to do with aggregate metrics to feed to our alerting infrastructure.
Describe alternatives you've considered
I'm currently leveraging the metrics provided by the kubernetes api server admission controller to see the counts of webhook invocations sent to the mpod.kb.io and it does provide some insights, but not all pod creations will be eligible for OTel instrumentation (i.e. they may or may not have the instrumentation.opentelemetry.io annotations.
Additional context
No response
I think we need to take a look into this because is not the first time we talk about it. I'll add a note to be discussed during the next SIG.
We discussed this during the SIG meeting on 13.02.2025, and agreed that this would be a desirable feature. There's some performance issues related to reporting the number of Pods that should be instrumented, but aren't, but simply counting errors as they happen should be fine.
What we need to do next is propose names for the new metrics and attributes. If anyone has suggestions, feel free to post them in this issue.
I want to try this. I observed that auto-injection injects environment variables through an Init container. Does this involve Init container metrics (e.g., failed variable injection)?
We had an issue in that all our Deployments were annotated correctly with the instrumentation.opentelemetry.io/inject-java annotation and all the pods were annotated with it as well but yet the Instrumentation wasn't applied even if it existed in the namespace.
The Troubleshooting guide on the Auto-instrumentation section was good to point us to the root of the issue being that the Deployment age was older than the Instrumentation resource but still it wasn't obvious at first why it wasn't being applied and if there were any logs or alerts to give you that insight.