OnEnding should make spans readonly, except within the processor.
The OnEnding spec mentions: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/sdk.md#onending
The SDK MUST guarantee that the span can no longer be modified by any other thread before invoking OnEnding of the first SpanProcessor
The current OnEnding implementation doesn't prevent the span from being modified in another thread, concurrently with the processor's action.
cc @pantuza
Two ways this can be done:
Java does it by storing the ID of the thread that runs OnEnding, and prevents any other thread from doing so. https://github.com/open-telemetry/opentelemetry-java/pull/6367/files#diff-d716139d47f651d6b6558f3e6f03aa59c98e52723f472b2c6a24bd02ec5fe2ceR338-R342
Go is going to setup a wrapper span that allows bypassing the span being read-only, and will be provided to the processor.
👋 This issue has been marked as stale because it has been open with no activity. You can: comment on the issue or remove the stale label to hold stale off for a while, add the keep label to hold stale off permanently, or do nothing. If you do nothing this issue will be closed eventually by the stale bot.
Should this be reported elsewhere?
@dmathieu this is the appropriate place, though I am not sure this is the bug we will run into when our users start to make changes to the span.
Span mutation methods leverage a mutex instance variable that prevents multiple threads or fibers from mutating the span at the same time.
In Span#finish, all span processors will have to have completed their on_finish blocks before the lock is released and the span may be mutated again:
https://github.com/open-telemetry/opentelemetry-ruby/blob/555b062ef9421784c132aa9b97b29ec637b13b0f/sdk/lib/opentelemetry/sdk/trace/span.rb#L268
That is where I think the real problem is. I assert that when a span processor attempts to mutate the underlying span, it will run into a thread deadlock situation.
Has anyone implemented one of these processors in the wild yet?
cc: @pantuza @mwear @kaylareopelle https://github.com/open-telemetry/opentelemetry-ruby/pull/1713
Hi folks, I agree that span.finish() method safely calls the processors on_finishing() method. Thus, prevents other threads from trying to modify the span. Although, if any user calls the processor.on_finishing() method directly inside many threads, indeed it can generate a concurrency issue.
Probably, calling this processor method directly isn't desirable, but we are never fully aware of how users will use the library. Thus, might be necessary to protect this method individually. Does it make sense?
@arielvalentin , I did not understand how the mutex.synchronize would generate the deadlock? Can you share an example, please? The only situation I can foresee this is the one where one on_finishing() implementation of any given processor get stuck and never releases the Mutex for the next thread trying to acquire it. Is it what you were saying?
It's likely not a deadlock situation as much as an error. I am concerned that mutexes are not re-entrant.
Assuming that is the case then when a span processor attempts to mutate it, then it will hit an additional synchronize block and that may result in errors.
It's likely not a deadlock situation as much as an error. I am concerned that mutexes are not re-entrant.
Assuming that is the case then when a span processor attempts to mutate it, then it will hit an additional synchronize block and that may result in errors.
You are totally right! If any given processor tries to mutate the span, for example setting attributes, it will try to acquire the same Mutex as the finish() method just did. Therefore, the thread gets blocked, cause the Ruby mutexes are not re-entrants as you mentioned before.
I see two alternatives:
- Use a separated Mutex that would be used on span Mutations such as:
a.
add_attribute,set_attribute,add_link,add_eventb. Current mutex being created - Use some other logic that do not require the same Mutex on both operations:
finishandset_attributesa. As the one@endedvariable we already use. b. As other languages did: Java cited by @dmathieu above in this conversation.
If I understand correctly, the span should store Fiber.current and then compare that when making modifications, e.g.
class Span
def initialize(fiber = Fiber.current)
@fiber = fiber
end
def update
raise RuntimeError, "Cannot modify span on non-current fiber!" unless @fiber.equal?(Fiber.current)
# ...
Not exactly. We want the ability for spans to be mutable in a fiber and thread safe way.
Spans may be writable from any thread or fiber as long as access is synchronized. When a span is "finished", then it changes to a state that is Immutable, a.k.a Read Only.
The span has an OnEnding stage where it notifies processors that it's about to transition to Read Only state. This is the final opportunity to make changes to the span before enqueing it for export.
So it's a case where a fiber would have an exclusive re-entrant lock on the span when it's in its transitioning state. If we capture the fiber that initialized it and "locked" on that then the span could not be mutated at any other time or could not be finished by a different fiber.
That's a good starting point for a workable solution, though. Similar to Go, you can pass a wrapper to the callback that has "special" access to the Span's internals, but which bypasses the mutex and instead checks that the owning Thread (or Fiber) is the current one.
Calling on_finishing from processor will cause deadlock if trying to modify any object inside span.
ERROR -- : OpenTelemetry error: deadlock; recursive locking - /usr/local/bundle/gems/opentelemetry-sdk-1.8.0/lib/opentelemetry/sdk/trace/span.rb:80:in `synchronize'
This is caused by double @mutex.synchronize on finish(end_timestamp: nil) and other function such as add_attributes, etc. from processor.on_finishing(self).
e.g.
class TestProcessor
def on_start(span, parent_context); end
def on_finishing(span)
span.set_attribute('hello', 'world')
end
def on_finish(span); end
end
The inability to edit span atributes in on_finishing is causing us a lot of pain - we've migrated from https://github.com/honeycombio/beeline-ruby/ and lost the ability to add an attribute to all parent spans from a child span which we were using in several places. I had written a SpanProcessor to do this by pulling fields from the current Context and adding them to spans in on_finishing before I realized that adding any attributes there causes the above deadlock error.