opentelemetry-operator icon indicating copy to clipboard operation
opentelemetry-operator copied to clipboard

Add instrumentation state information to pods

Open noMoreCLI opened this issue 3 years ago • 8 comments

Hi

when working with auto-instrumentation I found no quick way to verify which pods auto-instrumentation has taken place, applied successfully, ignored or failed. I ended up describing pods and looking for init containers.

Having this information would greatly simplify operations & troubleshooting, if the operator would add state information to pods, when annotating namespace for auto-instrumentation.

For example listing all pods with successful instrumentation: kubectl get pods --all-namespaces -l otel.auto-instrumentation.state=successful

@svrnm & @pavankrish123

noMoreCLI avatar Oct 05 '22 13:10 noMoreCLI

I totally agree.

This is related to old issue https://github.com/open-telemetry/opentelemetry-operator/issues/544

pavolloffay avatar Oct 10 '22 13:10 pavolloffay

@noMoreCLI is name "otel.auto-instrumentation.state" for label final?

akhileshsingh85 avatar Nov 02 '22 23:11 akhileshsingh85

@akhileshsingh85 no, I don't think so. I used this as an illustrative example. As I anticipate that in the future additional state information could be added, I think a structured approach could make sense, but open for comments.

noMoreCLI avatar Nov 04 '22 08:11 noMoreCLI

Not an expert on K8s, but from an end user perspective it would be good to have a few kinds of information:

  • on which workloads have we tried to apply auto-instrumentation (and maybe which one)? so I can identify if one was skipped or if the wrong auto instrumentation was applied for whatever reasons
  • was the auto-instrumentation successful (YES, NO, PENDING, ...)
    • if NO, some error details if possible
    • if YES, we are good
    • if PENDING, there seems to be something blocking, etc.

svrnm avatar Nov 07 '22 08:11 svrnm

Would it be better for some of these to use the InstrumentationStatus field, so that the information is in one spot vs applied over (possibly many) workloads?

damemi avatar Nov 07 '22 12:11 damemi

Correct, the information should be in the status field.

pavolloffay avatar Nov 07 '22 12:11 pavolloffay

Correct, the information should be in the status field.

As a starting point then, what about just adding status conditions to the crd? https://sdk.operatorframework.io/docs/building-operators/golang/advanced-topics/#manage-cr-status-conditions

I see that CollectorStatus implemented its own fields, but I think at least the conditions @svrnm suggested could be expressed with the upstream type. Willing to help with this

damemi avatar Nov 07 '22 13:11 damemi

I opened a POC pr at https://github.com/open-telemetry/opentelemetry-operator/pull/1228, happy to discuss other possibilities or suggestions

damemi avatar Nov 07 '22 21:11 damemi