opentelemetry-ruby icon indicating copy to clipboard operation
opentelemetry-ruby copied to clipboard

Context concurrency and parallelism issues

Open arielvalentin opened this issue 1 year ago • 11 comments

The OpenTelemetry::Context uses a fiber local Ruby Array to keep track of the active Context instance.

There are at least 2 instances in gems that are breaking encapsulation and copying fiber local variables to child threads:

This leads to concurrency and parallel processing issues because multi-threads are pushing and popping entries on the same Stack, resulting in OpenTelemetry::Context::DetachError as well as non-deterministic behavior where the OpenTelemetry::Context.current method returns the wrong instance of the context object.

Helpers that rely on the implicitly active Context, like OpenTelemetry::Trace.current_span, will end up receiving the wrong Span object instance.

There is a proposed fixed here that would end up using Immutable Arrays, however this will increase the number of object allocations per Context stack mutation:

https://github.com/open-telemetry/opentelemetry-ruby/pull/1760

Another option I have considered is to propose using Thread Local variables instead of Fiber local variables, however that would result in concurrency issues when using Fibers where multiple fibers would have access to the same Array instance. We would need a mechanism to seed a spawned thread with a snapshot of the stack so that there could be some consistent context propagation for the gems listed above.

I would like to get some feedback from @open-telemetry/ruby-maintainers on whether or not there are addition alternatives we can consider.

arielvalentin avatar Nov 21 '24 12:11 arielvalentin

@arielvalentin thanks for your great summary of an issue, including the links to real source code which exhibits the problem. I'm from the Ruby core team and work on concurrency and issues like this.

Copying around thread locals like this is like copying around private instance variables between objects, it's a horrible idea and should never be done.

There is now a feature in Ruby, which does this and is expected to do this: Fiber.storage

irb(main):001> Fiber[:foo] = :bar
=> :bar
irb(main):002* Thread.new do
irb(main):003*   pp Fiber[:foo]
irb(main):004> end
=> #<Thread:0x000074fcdef31970 (irb):2 run>
:bar

The fiber storage is inherited by copy so mutations in child keys isn't propagated upwards, but you can use mutable values (which must be thread safe) for this.

Using thread locals for "per request state" is an extremely bad idea.

ioquatix avatar Nov 22 '24 21:11 ioquatix

@ioquatix Thanks for sharing your perspective here.

I suspect that library maintainers are doing this out of convenience without knowing the negative impact that it has on their users.

In our case I would say we generally want implicit immutable state staring within the same thread, but likely explicit sharing when spawning fibers or threads.

As you pointed out if we did use fiber storage, we are still in need an efficient fiber local Immutable Stack or Array data structure to avoid non-deterministic behavior.

There are also other use cases that our users tend to rely on for resetting some state process forking. I personally rely on ActiveSupport::ForkTracker to address some of these, but in the case of the SDK the code checks for when process id changes to reset state in the child that was captured by the parent process.

I know it is quite risky to implement (e.g. fork bombing) but has there been any discussion to add lifecycle hooks to register custom code when processes, threads, or fibers are spawned?

arielvalentin avatar Nov 26 '24 13:11 arielvalentin

@ioquatix A related problem we have is finding a solution for re-entrant locks on our thread safe structures.

The SDK currently use mutexes to ensure thread safety when mutating a Span. We have a use case where we need to allow the thread to re-enter a mutex described here. I haven't done much in the way of exploring any additional options but your input would be helpful here:

https://github.com/open-telemetry/opentelemetry-ruby/issues/1740

arielvalentin avatar Nov 26 '24 13:11 arielvalentin

We should strongly encourage library authors to avoid this kind of code, and graphql-ruby are going to fix their implementation: https://github.com/rmosolgo/graphql-ruby/issues/5173#issuecomment-2501891466

Fiber storage supports immutable updates, e.g.

Fiber[:thing] = Fiber[:thing].with(...)

(assuming #with returns a new instance with the updated state).

I know it is quite risky to implement (e.g. fork bombing) but has there been any discussion to add lifecycle hooks to register custom code when processes, threads, or fibers are spawned?

Yes, see Process._fork. As for threads and fibers, it's generally a bad idea.

For thread and fiber locals, you may prefer to use attributes, these are not going to be copied by the code previously discussed, e.g.

Thread.attr_accessor :mylibrary_mything

Thread.current.mylibrary_mything = ... # proper thread local.

ioquatix avatar Nov 27 '24 23:11 ioquatix

For thread and fiber locals, you may prefer to use attributes, these are not going to be copied by the code previously discussed

I benchmarked that approach and several others, result summary is https://github.com/open-telemetry/opentelemetry-ruby/pull/1760#issuecomment-2492416415. For the recursive benchmark, a Fiber attribute is 20% slower than Thread.current fiber-local storage. Concurrent Ruby's FiberLocalVar is faster than a Fiber attribute, but relies on Fiber.storage for its implementation.

fbogsany avatar Nov 28 '24 02:11 fbogsany

I would try to avoid having a dependency on concurrent-ruby it is a lot to pull in.

ioquatix avatar Nov 28 '24 05:11 ioquatix

You should try Thread.attr ... or Fiber.attr ... it gets optimised because of object shapes. It will be the fastest IIRC.

ioquatix avatar Nov 28 '24 05:11 ioquatix

👋 This issue has been marked as stale because it has been open with no activity. You can: comment on the issue or remove the stale label to hold stale off for a while, add the keep label to hold stale off permanently, or do nothing. If you do nothing this issue will be closed eventually by the stale bot.

github-actions[bot] avatar Dec 29 '24 02:12 github-actions[bot]

On a tiny note, while the current storage mechanism is quite flexible internally, it does make it quite hard external tools such as profilers to get this data to include as correlation (e.g. https://github.com/DataDog/dd-trace-rb/pull/3984 is quite a maze). #842 is an earlier discussion of this too.

With opentelemetry also now having support for profiling, this use case may become more common (e.g. beyond just Datadog's Ruby profiler).

ivoanjo avatar Jan 02 '25 11:01 ivoanjo

@ioquatix @ivoanjo @zvkemp @bensheldon

Francis has a PR up for review with proposed changes to use Fiber attributes instead of Filber locals and I would appreciate your feedback/input given your interest and expertise: https://github.com/open-telemetry/opentelemetry-ruby/pull/1807

arielvalentin avatar Feb 07 '25 14:02 arielvalentin

Should this issue (and hence #1760) be closed since #1807 is already merged and released?

datbth avatar Sep 11 '25 05:09 datbth