opentelemetry-specification icon indicating copy to clipboard operation
opentelemetry-specification copied to clipboard

Use Context to stop tracing

Open pauldraper opened this issue 4 years ago • 32 comments

There are circumstances for instrumented actions where:

  1. The action is frequent and of low interest: a healthcheck, polling a message queue, etc.
  2. An OpenTracing exporter uses libraries that themselves may be instrumented (risking infinite tracing).
  3. If the current layer (e.g. RPC) happens to offer sufficiently detailed tracing, lower HTTP/DNS/TCP/UDP layers do not need to be traced when invoked with this library. This need is heighten by the use of server/client spans, which AFAIK needs to remove spans in between them. (#526).

(1) has come up in #173

(2) has been a problem in opentelemetry-js (https://github.com/open-telemetry/opentelemetry-js/issues/332) HTTP-based exporters, since the Node.js stdlib is instrumented globally. The solution in that has was adding a special HTTP header x-opentelemetry-outgoing-request headers, that the http instrumentation ignores. In essence it uses HTTP headers as a poor-man's context API. (This may not scale to other APIs.)


It would be nice to cut off all automatic tracing "below this current scope" by setting a context key. The default tracer implementation would create no-op spans if the context disables spans.

EDIT: Case (1) may call for sampling, rather than full disabling.

pauldraper avatar Mar 28 '20 14:03 pauldraper

OpenTelemetry Ruby does this and I really like the idea

dyladan avatar Mar 28 '20 15:03 dyladan

See the Python solution, which I think is rather elegant: https://github.com/open-telemetry/opentelemetry-python/pull/181 (updated in https://github.com/open-telemetry/opentelemetry-python/pull/395 for OTEP 66).

You introduce a conventional context variable "suppres_instrumentation" that all instrumentations check and the span processors set to true while exporting.

Oberon00 avatar Mar 30 '20 07:03 Oberon00

Why do the instrumentations check? Would you not just have the tracer check the context and return a non-reporting span?

dyladan avatar Mar 30 '20 12:03 dyladan

That would also work, but personally I prefer the more explicit no-magic approach here. Also, the instrumentation overhead can be further reduced if the instrumentation checks manually and shortcuts, instead of doing all the usual instrumentation with a no-op Span. Checking Span.IsRecording would probably work in most cases though.

Oberon00 avatar Mar 30 '20 14:03 Oberon00

I think that is the primary use for the isRecording flag. I just prefer to only have to implement such things in a single place so that there is less risk of forgetting. I'm playing around with such an implementation in JS right now and I set the context in the span processor, then return non-recording spans from the tracer if that context is set. It works well and required very few changes.

dyladan avatar Mar 30 '20 14:03 dyladan

@toumorokoshi @codeboten FYI

Oberon00 avatar Mar 30 '20 15:03 Oberon00

I agree with @dyladan .

(A) The tracer is already managing context (activeSpan, withActiveSpan).

(B) Fewer places is better.

pauldraper avatar Mar 30 '20 22:03 pauldraper

@pauldraper The tracer will stop "managing" context but it will still need to access the context when creating a new span, see #527

Oberon00 avatar Mar 31 '20 07:03 Oberon00

Hi, another opentelemetry-python contributor chiming in, thanks @Oberon00!

Why do the instrumentations check? Would you not just have the tracer check the context and return a non-reporting span?

I think have the tracer check the flag is a good idea. I don't think it's too much magic, and comes at the benefit of not having to require instrumentation authors to implement this pattern. We can control this behavior on the sdk level, which enables stronger guarantees on behavior consistency.

I think that is the primary use for the isRecording flag.

It doesn't look like that's the intention of IsRecording, at least from what I see in the spec:

Returns true if this Span is recording information like events with the AddEvent operation, attributes using SetAttributes, status with SetStatus, etc. There should be no parameter.

This flag SHOULD be used to avoid expensive computations of a Span attributes or events in case when a Span is definitely not recorded. Note that any child span's recording is determined independently from the value of this flag (typically based on the sampled flag of a TraceFlag on SpanContext).

I haven't seen examples of these expensive computations yet, but it seems unrelated to whether the span itself should be exported. I would argue for a separate context variable (suppressInstrumentation to borrow our current python convention).

The additional pro/con of a suppressInstrumentation flag separate from the Span itself is the fact that both metrics and spans can share that value. But maybe, it would be better if we had those as two separate flags as most likely you will only want to suppress one or the other, most likely in the exporter.

So maybe:

suppressTraceInstrumentation and suppressMetricInstrumentation?

toumorokoshi avatar Apr 01 '20 16:04 toumorokoshi

I haven't seen examples of these expensive computations yet

You have to look at everything the instrumentation does for each span. E.g. the sum of all the things that the Python WSGI instrumentation does per request is relatively expensive.

Oberon00 avatar Apr 01 '20 17:04 Oberon00

@toumorokoshi I was saying the primary use of isRecording is to avoid expensive operations to gather span attributes. @Oberon00 had recommended checking the context for instrumentation disabled flag within instrumentations because it would also avoid said expensive computations.

In my opinion the "correct" implementation on the instrumentation side would be:

  1. instrumentation calls startSpan
  2. tracer sees instrumentation disabled, returns a non-recording span
  3. instrumentation sees span is non-recording and avoids expensive operations

This avoids the need for every instrumentation to check this flag. Because there will be many instrumentations, we cannot guarantee none of them will forget.

The flag is set on the context by span processors before they call exporters, or optionally in the exporter. I have no strong opinion on this difference.

If an instrumentation wants to disable the instrumentation of the lower-level protocols, it may also set the flag. e.g. some database driver which uses http to talk to the database may want to disable the http instrumentation in favor of instrumenting the protocol they've built on top of it.

If starting the span itself requires one of these expensive operations, then nothing is preventing the instrumentations from also checking the context flag.

dyladan avatar Apr 01 '20 17:04 dyladan

I haven't seen examples of these expensive computations yet

You have to look at everything the instrumentation does for each span. E.g. the sum of all the things that the Python WSGI instrumentation does per request is relatively expensive.

I think I understand. You're referring to adding a check, per integration, to see if the IsRecording flag is set. if it's false, then we avoid adding attributes altogether, which would skip a non-trivial amount of code to collect and add attributes:

https://github.com/open-telemetry/opentelemetry-python/blob/master/ext/opentelemetry-ext-wsgi/src/opentelemetry/ext/wsgi/init.py#L121

I think if the spec called out explicitly that the resulting span would actually be missing attributes that would have been added in an IsRecording scenario, that would have been a little clearer to me. Thank you for clarifying.

In my opinion the "correct" implementation on the instrumentation side would be:

Make sense. As a clarification, is it just a non-recording span? It's a non-recording, no-op span that ensures that the span will never reach the exporter in the first place, correct?

Although it makes sense to me intuitively, I don't see a call-out for a no-op or default span object that would not emit spans to spanprocessors in the specification. So maybe that's something that should be added as well.

toumorokoshi avatar Apr 01 '20 23:04 toumorokoshi

I think I understand. You're referring to adding a check, per integration, to see if the IsRecording flag is set. if it's false, then we avoid adding attributes altogether, which would skip a non-trivial amount of code to collect and add attributes

yes

I think if the spec called out explicitly that the resulting span would actually be missing attributes that would have been added in an IsRecording scenario, that would have been a little clearer to me

A non-recording span doesn't record any attributes at all

As a clarification, is it just a non-recording span? It's a non-recording, no-op span that ensures that the span will never reach the exporter in the first place, correct?

A no-op span is one step further than a non-recording span. A non-recording span still propagates context, just doesn't record anything to the backend. It is useful in cases where you do not want to break traces. A no-op span can't be propagated, because it has no trace context or trace state. The non-recording span isn't sent to the backend, but propagates trace context correctly so that traces aren't broken. In this instance, a no-op span would be equally effective and possibly an even better choice.

dyladan avatar Apr 02 '20 01:04 dyladan

@toumorokoshi Your link is actually perfect to illustrate why we additionally could use the "surpress_instrumentation" context variable: In https://github.com/open-telemetry/opentelemetry-python/blob/v0.6.0/ext/opentelemetry-ext-wsgi/src/opentelemetry/ext/wsgi/init.py#L121, attributes are collected to be passed to start_span (so the sampler has more info and knows e.g. the URL), so the is_recording property could only be queried after.

Oberon00 avatar Apr 02 '20 07:04 Oberon00

@toumorokoshi Your link is actually perfect to illustrate why we additionally could use the "surpress_instrumentation" context variable: In https://github.com/open-telemetry/opentelemetry-python/blob/v0.6.0/ext/opentelemetry-ext-wsgi/src/opentelemetry/ext/wsgi/init.py#L121, attributes are collected to be passed to start_span (so the sampler has more info and knows e.g. the URL), so the is_recording property could only be queried after.

Ah! That's a very good point. I guess between the two, I would probably op to just use the context variable, which can then be wrapped in a utiltity method on the trace module itself like trace.instrumentationEnabled(). And I do agree that, in that case, that may make isRecording redundant.

toumorokoshi avatar Apr 02 '20 23:04 toumorokoshi

@pauldraper will a PR be filed on this? I like this approach and would love to see it in the spec so I can implement it.

toumorokoshi avatar Apr 16 '20 23:04 toumorokoshi

@trask I wonder, how does auto-instr-java deal with that? It has a gRPC instrumentation, would that not cause a feedback loop with the OTLP exporter currently?

Oberon00 avatar May 25 '20 14:05 Oberon00

In auto-instr-java, we load the exporter (and it's dependencies) inside of a separate class loader. This is primarily done to avoid version conflicts, but has the nice the side-effect that we can skip instrumentation of those classes (by looking at what classloader they're in) and avoid the feedback loop.

trask avatar May 25 '20 19:05 trask

One interesting caveat to the above: https://github.com/open-telemetry/opentelemetry-auto-instr-java/issues/375#issuecomment-630452595

trask avatar May 25 '20 19:05 trask

For a reference point, I've run into 3) myself. I find spans from e.g., the netty instrumentation noise when they're being wrapped by a client library, for example the aws sdk. Others may not though meaning some configurability would be nice. But it creates the problem that

  1. With auto instrumentation, it wouldn't really be possible to disable netty instrumentation only when called from an RPC SDK, have to disable it globally

  2. Even when disabled, it causes issues where semantics depend on whether there is an underlying instrumentation or not like in https://github.com/open-telemetry/opentelemetry-auto-instr-java/pull/323

Being able to pause new spans for a stack frame would help with this.

anuraaga avatar May 26 '20 00:05 anuraaga

This was discussed in today's Spec SIG. Would someone involved in the conversation today please post a summary of any decisions that were made?

jmacd avatar Apr 27 '21 19:04 jmacd

I created a PR based on the discussion from the SIG #1653

The discussion was primarily around clients creating and exporting APIs that were not specified and how to do it in a way that is compatible with future spec. It was generally agreed that it would be better to export it in some sort of API extensions package that can be rev'd to 2.0 without having to rev the whole API, and would not interfere if the API specification is added later.

In this particular instance, it was agreed that this is a generally good mechanism and should be added to the spec, so that's what #1653 does.

dyladan avatar Apr 27 '21 19:04 dyladan

@dyladan was there a decision around what the right way to move forward for suppressing instrumentations? I see the proposed solution of making a key available through the API was closed.

codeboten avatar May 27 '21 16:05 codeboten

This new instrumentationDisabled flag should be also propagated from process to process. Servers/Consumers could use it to automatically disable instrumentation for their spans.

My use case: multiple services uses common Redis library with added instrumentation. Service A periodically sends KeepAlive messages to other services, and I want to disable tracing for them on both Publisher and Subscriber sides. On Publisher side this is quite easy, I can tell Redis library skip instrumentation. However on Subscriber side this is not so easy, Redis library would have to examine messages to check if this is KeepAlive or not. One potential solution to it is to not send TraceParent/State from Publisher for KeepAlives, and create new span in Subscriber only if they are received. However this has a downside that spans are not created for messages sent from services which are not instrumented yet. Because of this it is better to explicitly send instrumentationDisabled flag.

sirzooro avatar May 27 '22 14:05 sirzooro

@dyladan was there a decision around what the right way to move forward for suppressing instrumentations? I see the proposed solution of making a key available through the API was closed.

Nope. The PR was closed because there wasn't agreement and as far as I know nobody ever followed up. I stopped working on it because I have to admit I was feeling a little defeated.

This new instrumentationDisabled flag should be also propagated from process to process. Servers/Consumers could use it to automatically disable instrumentation for their spans.

This introduces serious trust concerns. Without somehow signing the request to ensure that it is trusted, this would be ripe for abuse. I would argue that paying to cost of inspecting the message to see that it is a keepAlive on the server side is a much simpler solution less prone to errors and abuse.

dyladan avatar May 27 '22 17:05 dyladan

@sirzooro If you need propagation of the flag, you are probably looking for the existing sampled flag, not this feature.

Oberon00 avatar May 28 '22 19:05 Oberon00

@sirzooro If you need propagation of the flag, you are probably looking for the existing sampled flag, not this feature.

Thanks, this is exactly what I was looking for.

sirzooro avatar May 31 '22 11:05 sirzooro

The need for a mechanism like this was brought up again by @tsloughter in the spec call again today. At this point most if not all SIGs have some mechanism like the context solution that my PR proposed. At least JS, Ruby, Python, and .NET have all implemented something similar. Not sure which of those implemented in the API vs SDK, but there are definitely instrumentations in the wild using it in multiple languages.

In JS it was implemented as a context key which the SDK recognizes and provides a helper function to set and unset. IMO this is a poor situation because it tightly couples the instrumentation with the SDK. I would prefer to implement it in the API but did not want to do that without specification.

The proposal was a simple key which suppressed tracing by indicating a nonrecording span should be returned. This prevents exporter loops and allows higher level module instrumentations like axios to suppress lower level ones like http. Other signals were not needed to be suppressed because they don't have these problems (incrementing a metric on export does not cause another export like in tracing, and axios/http would have separate metric streams).

I would like to propose that we move forward with that original proposal. It solves the problems which need to be solved, is already implemented in many places, and does not preclude a more advanced per-signal, or any other, mechanism from being specified in the future. I think this is one place where we have let the desire for perfection get in the way of pragmatism.

dyladan avatar Dec 13 '22 21:12 dyladan

@dyladan by "original proposal" do you mean an existing PR? I was going to make one today and planned it to be a function like suppress_telemetry() which takes optional arguments to only suppress certain signals.

tsloughter avatar Dec 14 '22 18:12 tsloughter

@tsloughter I meant this https://github.com/open-telemetry/opentelemetry-specification/pull/1653

dyladan avatar Dec 14 '22 20:12 dyladan