specification understand and address concerns with ActiveSpan refcounting

Per comments like this one and notes like these from @adriancole's recent distributed tracing workshop, there are justifiable questions about the refcounting happening in OT-Java's ActiveSpan and its (non-merged) OT-Python equivalent.

If I understand correctly (and set me straight me if I do not), nobody wonders whether there's a need to "activate" (and deactivate) an ActiveSpan in some sort of thread-local, most likely using a scoped language construct like try-with-resources in Java or with in Python. That said, there are significant concerns about the coupling of the activation/deactivation and decisions about when Spans are actually finished.

OT-Java's ActiveSpan automatically finishes the underlying Span as soon as the last reference is decremented/deactivated. This makes common-case application code shorter and less error-prone since the programmer need not remember to both finish and deactivate ActiveSpans, especially when deferred work (via closures/callbacks) is involved.

The current OpenTracing approach was also designed to allow for other approaches to this problem by factoring out the ActiveSpanSource interface: e.g., one could imagine an ActiveSpanSource adapter that allowed for ref-count-free activate/deactivate semantics like the ones @bogdandrutu described here.

Changing the current API would be painful (and would be done carefully if at all), but getting this right is extremely important (hence the volume of discussion about PRs like OT-py #52, and OT-java #111, #115 (which was merged), #116, and #117).

My hope in creating this issue is that @adriancole @bogdandrutu @palazzem @jakerobb @lotharsee @pavolloffay @objectiser @carlosalberto @clutchski @yurishkuro or others will chime in with documentation of current concerns / missed requirements / etc and we can go from there. If there are other PRs/issues/docs where ActiveSpan troubles have come up, please drop them in here as well.

Jul 11 '17 21:07 bhs

(fixed title to more accurately reflect intent here)

Jul 11 '17 21:07 bhs

Main concern summed up is that the current approach is the last thing we'd want to use for anything deemed common-case, as it is flawed and hard to spot how it is broken. Doing so amplifies the support implication of figuring out which thing was called incorrectly then shifted the whole thing out of alignment.

The alternative is being explicit, which in the workshop we found to not notably increase the size of the code, nor its complexity. For example, it replaced awkward calls with explicit ones. Since the explicit ones are safer, it seems a no brainer to start with safe instead of putting in something likely to make implementation unreliable.

Jul 12 '17 01:07 codefromthecrypt

My concrete suggestion: rip out the the magic finisher part and keep the ability to implement scoping. Make the scoping thing like others where idiomatic in language (census, grpc, finagle, brave, etc. etc.) Standards usually follow practice, so there's no good reason to not do that.

The magic finisher could be a "use at your own risk" decorating utility, which hopefully wouldn't be used by any serious instrumentation as those writing RPC should do things properly.

Jul 12 '17 01:07 codefromthecrypt

@adriancole re:

... the current approach is the last thing we'd want to use for anything deemed common-case, as it is flawed and hard to spot how it is broken

The current ActiveSpan concepts are a translation of the concepts in Dapper's C++ and Java integration libraries. @bogdandrutu accurately pointed out that there are some tradeoffs/problems therein (esp with daemon threads, repeated callbacks, etc), but there were also meaningful benefits when confronted with async control flows and routine callback/closure functions. E.g., the basic refcounting approach here "just worked" for 10K-span traces of web search, etc, even with a programming idiom that was highly async.

I also wanted to point out that OT-Java intentionally supports both "manual" and "active" Span management modalities, and not just for compatibility reasons: the former does no refcounting at all (nor does it register with the ActiveSpanSource / TLS) and is designed for situations where explicit start/finish is preferable.

I think what you're voting for is something in between, where there's an explicit start/finish on the Span as well as a scoped activate/deactivate mechanism that never finishes the Span, correct? That could be as simple as making the refcounting opt-in via an autoFinish param, a la SpanBuilder#startActive(bool autoFinish) or ActiveSpanSource#makeActive(Span span, bool autoFinish).

Jul 12 '17 05:07 bhs

@bhs I had ping you and OT team at an sky-walking issue and gitter, but, sadly, I got no response yet. In sw 3.2 iteration, I want to push to support OT-java 0.30, here is the bridge code:

https://github.com/wu-sheng/sky-walking/tree/feature/high-performance-agent/apm-application-toolkit/apm-toolkit-opentracing/src/main/java/org/skywalking/apm/toolkit/opentracing

With the NeedSnifferActivation annotation, means what sw will do in the agent.

Here are two parts of How sky-walking did the supporting for OT-java 0.30.0**.

For Continuation, sw can't support the ref counter, because we don't care about it.

In sw, each trace has multi trace segments, each one of them belongs to one thread only, So when cross-thread happened, we capture a snapshot( something like Continuation ), and continued it in sub thread, even in batch mode.

In the process, we just build the references between two segments (multi thread). That is why sw didn't need counter mechanism.

For ActiveSpan and Span, sw just consider them as a shell for sw core/real span. The active depends on the context stack, the top one is active, otherwise, it isn't.

Jul 12 '17 05:07 wu-sheng

I was looking around on github to try and grok anyone's real world experience with ActiveSpan as it is defined today. What I found was that few tracers support it, which makes me wonder how it was tested in any practical sense. This makes it an odd choice to make as a primary api.

https://github.com/search?l=Java&q=ActiveSpanSource&type=Code&utf8=%E2%9C%93

For example, of the maintained tracers, I can only find noop impls ex in lightstep and instana (cc @codingfabian). Is anyone using this or can't live without it? If this is a niche thing, maybe we can save time discussing by chopping it.

Jul 12 '17 05:07 codefromthecrypt

@bhs I think we should consider desired practice at google if we are citing practice at google. census is being developed without this style, by folks at google.

Specifically, I do not want this fake reference counting (counting close is not reference counting) as a part of the core library. IOTW, remove the ActiveSpan api, or move it such that implementing a tracer, spanbuilder etc doesn't require exposing this errorprone type.

Jul 12 '17 06:07 codefromthecrypt

concretely, in census, there's a convenience type to close a span and a scope simultaneously. This is a lot different than counting calls to close and assuming a certain count implies finishing the span.

Here's the example from census (which I reviewed). Notably this is a lot less fancy, and therefore easier to support for routine synchronous tracing.

try (Scope ss = tracer.spanBuilder("MyChildSpan").startScopedSpan()) {
  tracer.getCurrentSpan().addAnnotation("my annotation");
  doSomeWork();  // Here the new span is in the current Context, so it can be used
                 // implicitly anywhere down the stack. Anytime in this closure the span
                 // can be accessed via tracer.getCurrentSpan().
}

https://github.com/census-instrumentation/opencensus-java/blob/master/api/src/main/java/io/opencensus/trace/SpanBuilder.java#L218

https://github.com/census-instrumentation/opencensus-java/blob/master/api/src/main/java/io/opencensus/trace/ScopedSpanHandle.java

Jul 12 '17 06:07 codefromthecrypt

Specifically, I do not want this fake reference counting (counting close is not reference counting) as a part of the core library

+1. And further advice, Continuation#activate returns an ActiveSpan, but without any chance to set operationName for this span in initialization stage, I didn't like it. When cross-thread, the spanA in parent thread, don't mean spanA should be active in sub thread, so maybe the tracer implementation will build a new span(spanB) and set a reference (spanB child_of/follow_of spanA). At this point, no operationName is intractability.

Since OT is just an API layer, we didn't supervise Span is active or not, so IMO, the only one Span concept is better.

Jul 12 '17 06:07 wu-sheng

without any chance to set operationName for this span in initialization stage, I didn't like it

Not sure what you mean by this. BaseSpan (superclass of Span and ActiveSpan) has a setOperationName method on it. You can call it from a continuation or from the original active span.

Jul 12 '17 14:07 jakerobb

@jakerobb The continuation#activate returns an activeSpan, I mean where is from? In sky-walking, we didn't propagate the span in the parent thread, into the child/sub thread, so we have to create a new one. But in here, the user can't set the operationName.

I know the user can call the setOperationName later, after the span created. But in Tracer interface, OT asked user to create a spanBuilder by the given operationName, like this:

    /**
     * Return a new SpanBuilder for a Span with the given `operationName`.
     *
     * <p>You can override the operationName later via {@link Span#setOperationName(String)}.
     *
     * <p>A contrived example:
     * <pre><code>
     *   Tracer tracer = ...
     *
     *   // Note: if there is a `tracer.activeSpan()`, it will be used as the target of an implicit CHILD_OF
     *   // Reference for "workSpan" when `startActive()` is invoked.
     *   try (ActiveSpan workSpan = tracer.buildSpan("DoWork").startActive()) {
     *       workSpan.setTag("...", "...");
     *       // etc, etc
     *   }
     *
     *   // It's also possible to create Spans manually, bypassing the ActiveSpanSource activation.
     *   Span http = tracer.buildSpan("HandleHTTPRequest")
     *                     .asChildOf(rpcSpanContext)  // an explicit parent
     *                     .withTag("user_agent", req.UserAgent)
     *                     .withTag("lucky_number", 42)
     *                     .startManual();
     * </code></pre>
     */
    SpanBuilder buildSpan(String operationName);

We are providing two style to create span, I mean this.

Jul 12 '17 14:07 wu-sheng

continuation.activate doesn't create a new span, though. It gives you an instance of an already-existing span for which you've already provided an operation name.

Jul 12 '17 15:07 jakerobb

That is another issue I mentioned, as a spec, we shouldn't ask the implementation to propagate span in parent thread, which is not the way sky-walking did for cross thread.

That is why I say so. Hope it is clear for you.😃

Jul 12 '17 15:07 wu-sheng

The reason our instana tracer looks like noop is because its actual inner workings are injected at runtime. We do not use refcounting, and we do not want to use refcounting. We have a proprietary cross thread active span tracking method that is more efficient, because it does not allocate memory for new spans. In our OT compatibility mode we currently do not use the refcounting, because in all current cases our own mechanism works. should a customer want to use refcounting semantics, we would need to change that, but this would come then with a performance hit.

Jul 12 '17 15:07 CodingFabian

skywalking did very similar works with instana. @CodingFabian. Agree with you in refcounting.

Jul 12 '17 15:07 wu-sheng

@adriancole

I can only find noop impls ex in lightstep and instana

LightStep's Tracer uses the ActiveSpanSource provided by the caller or the straightforward OT util ActiveSpanSource by default.

I will respond to the more important points separately.

Jul 12 '17 16:07 bhs

Does it make sense to have a bit of a side by side comparison of the proposed approaches? This might make it a bit easier for me to parse.

Jul 12 '17 16:07 clutchski

Re important points:

Regarding Google/Dapper/Census: my point about Dapper's instrumentation library is simply that the refcounting approach worked well in a mature, complex codebase that involved substantial intrinsic complexity, asynchrony, callbacks, and so forth. Census != "all of Google", and of course neither does Dapper. This is a data point, period, and IMO refutes the idea that the refcounting approach is crazy/broken/indefensible. It is certainly debatable, but that's different than "broken".
Regarding your specific suggestions, I (ironically given the bickering just above) like them, at least in spirit. I particularly like the idea that the core concepts around ActiveSpan and the reference counting pieces can be decoupled more than they are now.
Regarding production usage: I am not allowed to specify companies for contractual reasons, but this stuff does run (and operate as intended) in production environments at many brand-name tech companies.

I'll let others chime in with their thoughts on this. I am on paternity leave right now but will try to find time to experiment with ^^^ decoupling and see if there's a path forward that's (a) not terribly disruptive and (b) provides more optionality for callers and Tracer implementors.

Jul 12 '17 16:07 bhs

@clutchski

Does it make sense to have a bit of a side by side comparison of the proposed approaches?

(Definitely)

Jul 12 '17 16:07 bhs

FWIW, on https://github.com/openzipkin/brave-opentracing/pull/43, I have brave-opentracing working with opentracing-api-0.30.0 using the refcounting approach. We have 30+ services instrumented with it and working great.

Jul 12 '17 17:07 jakerobb

@bhs

I am on paternity leave right now

Congrats! Get some sleep. :)

Jul 12 '17 18:07 jakerobb

provides more optionality for callers and Tracer implementors.

+1. I don't like the ref counter, I am not saying it's wrong or didn't work. @bhs . Just want the spec and OT-java lib let the implementation make their own choices.

Jul 13 '17 00:07 wu-sheng

@jakerobb for disclosure we should say that the implementation had a double-increment bug just fixed. Such a bug is possible with the fake ref-counting (again folks counting close is not ref counting!) and impossible without. All code has bugs, certainly mine does, but this was hard to figure out, right?

Jul 13 '17 01:07 codefromthecrypt

@adriancole you are sounding like @RealDonaldTrump with this "fake ref-counting" FUD! There is, in fact, a threadsafe count of references to an ActiveSpan or its Continuation. When the count gets to zero, something happens. How the count is incremented or decremented is neither here nor there.

Regarding the next steps here, I would like to try to come up with some alternative proposals that will give all OT users (i.e., regardless of Tracer implementation) something like the current approach from a semantic standpoint, as I maintain that there are significant benefits which I can enumerate given more time; and I would also like to give people a way to activate/deactivate a Span without getting into any of the apparently controversial implicit-finish behavior.

Jul 13 '17 04:07 bhs

@bhs I don't follow the user you mentioned, so not sure how much I sound like him. Can you point me to the code where what you said is true? and that in reality it isn't just counting the amount of times someone invokes deregister?

I see code like this, which is literally counting the times close is called? Or isn't it? FUD implies that I'm saying something that isn't true. Tell me what I don't understand about this code, and that you do understand about it?

public void deactivate() {
  if (0 == refCount.decrementAndGet()) {
     wrapped.finish();
     scope.close();
     source.deregisterSpan(wrapped.unwrap());
  }
}

--snip--

public void close() {
  deactivate();
}

Jul 13 '17 10:07 codefromthecrypt

@adriancole the reference count is incremented whenever a Continuation is capture()ed and decremented whenever an ActiveSpan is deactivate()d (or close()d, which is a simply a synonym for deactivate() that's provided for try-with-resources). As documented here, the last ref-- triggers finish(). I.e., there is a reference count... it increments during capture() and decrements during deactivate() (or the synonym, close()). There is nothing fancy about it, nor is there anything "fake" about it.

Next time I can really sit down with my laptop I will try to move forward with the meat of this issue as I do think there is good reason to make changes per your concerns.

Jul 13 '17 15:07 bhs

I think part of the confusion is the fact that it isn't really clear how instrumentation is supposed to interact with the continuation model. My understanding of the intent is that the continuation keeps the existing tracer around while work is being done on different threads, allowing other threads to link new spans to it, but those new spans aren't created automatically and a span can be kept open by other threads without anything being added to it.

Anyway, this is a specific area where I think it would be beneficial to have better documentation and examples in addition to a cross implementation compliance test.

Jul 14 '17 06:07 tylerbenson

I think part of the confusion is the fact that it isn't really clear how instrumentation is supposed to interact with the continuation model.

+1. A tracer implementation can support the interop module for a trace crosses threads, but each implementation has their own way.

Jul 14 '17 07:07 wu-sheng

From our perspective (Kamon/Kamino teams) the main concern with the automatic closing via refcount is that it mixes two different concerns: in-process context propagation and span life-cycle management.

Having any sort of ref counting in Spans can break behavior without notice. I can totally see a typical Servlet application relying on ActiveSpan auto-finishing and for it to work nice, but as soon as any app/library code starts capturing continuations to be used after the response has been sent to the client or just capturing them and failing to use them, the main request Span might take longer to be automatically finished, if at all. This is my main concern with ref counting as it opens the door for things to break. @bhs mentioned that OT also supports the "manual" mode which doesn't do ref counting nor puts anything in TLS, but in most cases you definitely need to have the Span in TLS for it to properly propagate when RPC/Database/TracedWhatever happen, so, you end up needing a ActiveSpan that doesn't finish automatically. It has already been mentioned that SkyWalking and Instana do not support ref counting and we at Kamon are not supporting it neither because of the reasons exposed above. This might be a hint that something should definitely be done about it.

The ActiveSpanSource/Continuation/ActiveSpan abstractions are quite comfortable to use IMHO, they abstract away where the ActiveSpan actually is (even though most probably it is going to be in a ThreadLocal) and allows for swapping behavior when needed while still providing a simple and expressive API. In that sense, awesome.. for the sole purpose of in-process context propagation these abstractions are nice, but mixing them up with span life-cycle management is like signing up for unexpected behavior and debugging nightmares.

I already asked whether ref counting was a required behavior for tracers and the answer seemed pretty clear, although the description of this issue suggests that it is fine for a tracer not to do this. If it is fine for a tracer not to do it then it should be absolutely clear in the documentation. That being said, my suggestion is to completely remove any mention/implementation of reference counting from the opentracing-java codebase and let the tracers give this functionality to their users if/when it makes sense.

Jul 14 '17 07:07 ivantopo

@ivantopo interesting input. One thing I've to ask you is about the difference between something that looks like a good abstraction and something that is a good abstraction. In past work, I've usually found 3 implementations to be the sweet spot where you know the abstraction is starting to hold water. Meanwhile, the ActiveSpanSource, regardless of its likability is lacking in this department. Somehow it is promoted to a core api: you cannot implement a tracer without requiring it. Do you feel it is a good idea to keep such an api as a core api considering the second open source implementation resulted in quite a lot of discussion including a surprise to many that it is in fact hinting at an impl (aka leaky abstraction)?

Jul 14 '17 10:07 codefromthecrypt

specification specification copied to clipboard

understand and address concerns with ActiveSpan refcounting

specification
specification copied to clipboard