Context concurrency and parallelism issues
The OpenTelemetry::Context uses a fiber local Ruby Array to keep track of the active Context instance.
There are at least 2 instances in gems that are breaking encapsulation and copying fiber local variables to child threads:
This leads to concurrency and parallel processing issues because multi-threads are pushing and popping entries on the same Stack, resulting in OpenTelemetry::Context::DetachError as well as non-deterministic behavior where the OpenTelemetry::Context.current method returns the wrong instance of the context object.
Helpers that rely on the implicitly active Context, like OpenTelemetry::Trace.current_span, will end up receiving the wrong Span object instance.
There is a proposed fixed here that would end up using Immutable Arrays, however this will increase the number of object allocations per Context stack mutation:
https://github.com/open-telemetry/opentelemetry-ruby/pull/1760
Another option I have considered is to propose using Thread Local variables instead of Fiber local variables, however that would result in concurrency issues when using Fibers where multiple fibers would have access to the same Array instance. We would need a mechanism to seed a spawned thread with a snapshot of the stack so that there could be some consistent context propagation for the gems listed above.
I would like to get some feedback from @open-telemetry/ruby-maintainers on whether or not there are addition alternatives we can consider.
@arielvalentin thanks for your great summary of an issue, including the links to real source code which exhibits the problem. I'm from the Ruby core team and work on concurrency and issues like this.
Copying around thread locals like this is like copying around private instance variables between objects, it's a horrible idea and should never be done.
There is now a feature in Ruby, which does this and is expected to do this: Fiber.storage
irb(main):001> Fiber[:foo] = :bar
=> :bar
irb(main):002* Thread.new do
irb(main):003* pp Fiber[:foo]
irb(main):004> end
=> #<Thread:0x000074fcdef31970 (irb):2 run>
:bar
The fiber storage is inherited by copy so mutations in child keys isn't propagated upwards, but you can use mutable values (which must be thread safe) for this.
Using thread locals for "per request state" is an extremely bad idea.
@ioquatix Thanks for sharing your perspective here.
I suspect that library maintainers are doing this out of convenience without knowing the negative impact that it has on their users.
In our case I would say we generally want implicit immutable state staring within the same thread, but likely explicit sharing when spawning fibers or threads.
As you pointed out if we did use fiber storage, we are still in need an efficient fiber local Immutable Stack or Array data structure to avoid non-deterministic behavior.
There are also other use cases that our users tend to rely on for resetting some state process forking. I personally rely on ActiveSupport::ForkTracker to address some of these, but in the case of the SDK the code checks for when process id changes to reset state in the child that was captured by the parent process.
I know it is quite risky to implement (e.g. fork bombing) but has there been any discussion to add lifecycle hooks to register custom code when processes, threads, or fibers are spawned?
@ioquatix A related problem we have is finding a solution for re-entrant locks on our thread safe structures.
The SDK currently use mutexes to ensure thread safety when mutating a Span. We have a use case where we need to allow the thread to re-enter a mutex described here. I haven't done much in the way of exploring any additional options but your input would be helpful here:
https://github.com/open-telemetry/opentelemetry-ruby/issues/1740
We should strongly encourage library authors to avoid this kind of code, and graphql-ruby are going to fix their implementation: https://github.com/rmosolgo/graphql-ruby/issues/5173#issuecomment-2501891466
Fiber storage supports immutable updates, e.g.
Fiber[:thing] = Fiber[:thing].with(...)
(assuming #with returns a new instance with the updated state).
I know it is quite risky to implement (e.g. fork bombing) but has there been any discussion to add lifecycle hooks to register custom code when processes, threads, or fibers are spawned?
Yes, see Process._fork. As for threads and fibers, it's generally a bad idea.
For thread and fiber locals, you may prefer to use attributes, these are not going to be copied by the code previously discussed, e.g.
Thread.attr_accessor :mylibrary_mything
Thread.current.mylibrary_mything = ... # proper thread local.
For thread and fiber locals, you may prefer to use attributes, these are not going to be copied by the code previously discussed
I benchmarked that approach and several others, result summary is https://github.com/open-telemetry/opentelemetry-ruby/pull/1760#issuecomment-2492416415. For the recursive benchmark, a Fiber attribute is 20% slower than Thread.current fiber-local storage. Concurrent Ruby's FiberLocalVar is faster than a Fiber attribute, but relies on Fiber.storage for its implementation.
I would try to avoid having a dependency on concurrent-ruby it is a lot to pull in.
You should try Thread.attr ... or Fiber.attr ... it gets optimised because of object shapes. It will be the fastest IIRC.
👋 This issue has been marked as stale because it has been open with no activity. You can: comment on the issue or remove the stale label to hold stale off for a while, add the keep label to hold stale off permanently, or do nothing. If you do nothing this issue will be closed eventually by the stale bot.
On a tiny note, while the current storage mechanism is quite flexible internally, it does make it quite hard external tools such as profilers to get this data to include as correlation (e.g. https://github.com/DataDog/dd-trace-rb/pull/3984 is quite a maze). #842 is an earlier discussion of this too.
With opentelemetry also now having support for profiling, this use case may become more common (e.g. beyond just Datadog's Ruby profiler).
@ioquatix @ivoanjo @zvkemp @bensheldon
Francis has a PR up for review with proposed changes to use Fiber attributes instead of Filber locals and I would appreciate your feedback/input given your interest and expertise: https://github.com/open-telemetry/opentelemetry-ruby/pull/1807
Should this issue (and hence #1760) be closed since #1807 is already merged and released?