opentelemetry-ruby icon indicating copy to clipboard operation
opentelemetry-ruby copied to clipboard

rfc: Allow reading current context for another thread

Open ivoanjo opened this issue 3 years ago • 9 comments

I've been working on integrating Datadog's continuous profiler with opentelemetry traces (see Datadog/dd-trace-rb#1568).

The profiler runs on a background thread, and needs to be able to access the current context for a thread, to be able to get the current span's trace id and span ids (if any active).

To do so, I was originally using

thread = ...
::OpenTelemetry::Trace.current_span(thread[::OpenTelemetry::Context::KEY])

Unfortunately, after #807, this interface was changed, and more importantly, Context::KEY was removed and replaced with Context::STACK_KEY. STACK_KEY was marked as private_constant.

With 1.0.0.rc2, the only way of getting this information is by relying on private implementation details, which isn't great.

Thus, I would like to ask if it'd be reasonable to add an optional thread parameter to Context.current. This would make it easy to access the needed information, and it would even be more future-proof as the outside code doesn't need to care anymore where and how the context is stored.

ivoanjo avatar Jun 25 '21 10:06 ivoanjo

CLA Signed

The committers are authorized under a signed CLA.

  • :white_check_mark: Ivo Anjo (fcc9a6aa35894998edeaaad625c84536fda118c2)

I'm mildly terrified by this proposal. Accessing another thread's thread locals seems fraught with problems, and it is pretty violating too. I need to 🤔 about this more. Some initial poking around turned up a bug in Puma earlier this year, triggering a segfault in ruby 2.6 when accessing other threads' thread locals. That doesn't inspire confidence. Any such segfault will point to otel ruby as the culprit.

fbogsany avatar Jun 28 '21 14:06 fbogsany

To be fair, any issues would only appear when the API is exercised, so if anything, I'd expect a "we enabled the profiler and our app broke" than the reverse 🤣 . The stack trace will also be pointing at the profiler calling into otel-ruby.

Ruby being a managed VM, a segfault should never happen. The Ruby team seems to be quite responsive to those kinds of issues, so I'm sure such reports would be treated seriously.

Is there anything I can add that would help to derisk this in your view?

ivoanjo avatar Jun 28 '21 14:06 ivoanjo

@ivoanjo We talked about this today in the Ruby SIG - we'd like to continue to push back since we think this is largely only safe in MRI and we can't guarantee it wouldn't blow up in other implementations.

Given that dd-trace will need to monkey-patch BatchSpanProcessor to provide trace <-> profile correlation anyways, is it possible to also carry a monkey-patch for Context as well?

If it turns out to be impossible, we may be able to provide a "safety off" flag that still allows this, but we'd prefer not to if possible.

ahayworth avatar Jun 29 '21 16:06 ahayworth

Thanks for the feedback!

Given that dd-trace will need to monkey-patch BatchSpanProcessor to provide trace <-> profile correlation anyways, [...]

I was actually not planning on going the monkey patch route to provide the correlation. Instead, I created a "Span Processor" that users added in their configuration block and that would add the needed information on on_start. Here's a direct link to the draft docs describing it.

is it possible to also carry a monkey-patch for Context as well?

If it turns out to be impossible, we may be able to provide a "safety off" flag that still allows this, but we'd prefer not to if possible.

There's actually no need for a monkey patch. Since this is Ruby, I can always reach in and access the STACK_KEY constant and effectively reimplement the current method I'm proposing on my side.

My concern is that, well ~~that's cheating~~ this creates a very tight coupling on the current private implementation of opentelemetry-ruby, which can break at any time, thus making users that use dd-trace-rb + otel-ruby unhappy.


I spent some time brainstorming some alternative approaches:

  1. Adding a thread argument in calls to OpenTelemetry::Trace.current_span(thread: some_thread). This would abstract away the existence of a context at all, and would just return the span, or Span::INVALID as usual. Because how we get the context is now abstract, the actual implementation matters less, and can be updated and maintained as needed to work on different rubies.

    One note that I'll add is that Ruby is defined by what MRI is, so other Rubies should actually provide compatibility with the thread-local APIs, but I agree that there may be performance considerations attached to it :)

  2. Getting even more to the point, by adding an OpenTelemetry::Trace.current_span_context(thread) or even OpenTelemetry::Trace.current_ids(thread) API. This is even more direct than getting the span and the context and gets down to the point -- just exposing the ids for any debugging or profiling tools that want to show them.

  3. Make STACK_KEY a non-private constant (like KEY was before it). This makes it part of the API and thus a breaking change if it's updated, and thus it's much easier to spot and guard against changes.

  4. Do nothing. As I said above, I have it working, it's just that it's very brittle, and I'd like to provide a better experience to both our users :)

ivoanjo avatar Jun 30 '21 11:06 ivoanjo

One note that I'll add is that Ruby is defined by what MRI is, so other Rubies should actually provide compatibility with the thread-local APIs, but I agree that there may be performance considerations attached to it :)

There are differences other than just performance, specifically around threading (concurrent-ruby exists to abstract these differences) and processes (e.g. fork). While other implementations strive to match MRI's behaviour where they can, there is no formal spec to follow and none of them has achieved this goal, yet they are used in production environments anyway and we strive to support them.

There is a 5th option, which is for you to completely replace the Context implementation with one that matches the public API and gives you the hooks you need. This would give you the stability you need, and if we make improvements to our implementation, you could choose whether or not you want to pick them up. You would then not depend on our internal implementation details and your customers will have a clearer idea of where to direct support requests.

We're currently quite close to a 1.0 release for Tracing, and our highest priorities are stability, spec compliance, and minimizing the surface area of the API we have to support. Adding an optional thread (or fiber) argument is a backwards compatible change we can make later, but we'd prefer to defer that addition until we've had time to understand the design space better and the spec has evolved to address related concerns, such as metric exemplars and exposing trace context to loggers (both of which can have similar threading issues).

fbogsany avatar Jun 30 '21 14:06 fbogsany

@ivoanjo and I talked a little bit this morning and the general outcome was:

  1. We aren't sure how we'd really implement the "5th option" (replacing the Context implementation). The current system isn't designed to be pluggable, and so the only options seem to be "re-define the entire class and all methods within it" or "contribute some kind of ContextManager implementation". Neither one seems particularly appealing - "re-defining the class" is a lot of work and seems brittle because it's still relying on SDK internals; and contributing a "ContextManager" implementation seems like the kind of thing we don't want to do right before a 1.0 release. Perhaps the "ContextManager" idea would be good for post-1.0, but I also don't know if anyone else even needs such a thing.

  2. @ivoanjo's option 2 - a "debug" API of sorts - seems the most appealing option to us at the moment, and is probably minimally invasive. We discussed multiple ways that it could be implemented:

    1. A map that is updated when a span is made active, containing thread identifier => trace id, span id. The immediate complication is that the map would need to be thread-safe. The performance implications really hinge on whether or not we make the thread-safety guarantees only for writes, or for writes and reads. If the thread-safety guarantees are also made for reads, then a frequent reader (like, for example, some kind of always on distributed profiler 😉 ) could drag down the SDK's performance substantially.
    2. Some kind of debugging callback system, which would receive the thread ID => trace ID, span ID information when a span's context was made current. If no debugging handlers are attached, then this has the advantage of being the least impact on performance - nothing happens, and we only pay the overhead of looking up whether or not any debug handlers are attached. The callbacks could become a major drag on performance, however, if the author of the callbacks does something inefficient themselves. On the other hand, that's also an opportunity for performance-conscious folks like @ivoanjo to really optimize how they're tracking thread ID => span mappings in their SDKs.

I think I'm going to take a stab at option 2 this week, just to see if I can come up with something that seems reasonable. I'd be interested if others have additional thoughts here... I'm taking this as an opportunity for us to start building out some debug helpers in the SDK for folks, which seem like something we'll want as it grows in popularity. 😄

ahayworth avatar Jul 06 '21 14:07 ahayworth

the only options seem to be "re-define the entire class and all methods within it"

That was my intent, yes. It is not a large class, and it is self-contained - nothing in the API is reaching into its internals. As long as you match the public API (which we have to commit to supporting for something like 3 years IIRC), you should be fine ™️ .

fbogsany avatar Jul 06 '21 14:07 fbogsany

👋 This pull request has been marked as stale because it has been open with no activity. You can: comment on the issue or remove the stale label to hold stale off for a while, add the keep label to hold stale off permanently, or do nothing. If you do nothing this pull request will be closed eventually by the stale bot

github-actions[bot] avatar Apr 25 '24 01:04 github-actions[bot]