dd-trace-rb
dd-trace-rb copied to clipboard
Allow Developers to Tag Traces After Response Sent To User
Is your feature request related to a problem? Please describe.
Our application spends a considerable amount of time (p50 10ms, p95 25ms) adding tags to traces on every single request. The tags are essential to our observability, but adding them slows down requests for users.
Describe the goal of the feature
The goal of the feature is to make it easy for developers to add tags to a trace after the response is sent to the user. This could be accomplished using rack after-reply procs (supported by puma, unicorn, maybe others?). We were able to do this but it involved monkey patching the rack request span to defer finishing:
class << span
alias_method :original_finish, :finish
def finish(_end_time = nil)
# Record the time that the span is finished so we can pass it along later
@finished_at = Datadog::Core::Utils::Time.now.utc
end
def deferred_finish
original_finish(@finished_at)
end
end
class AfterReplyMiddleware
...
def call(env)
env['rack.after_reply'] ||= []
env['rack.after_reply'] << -> do
span = env['datadog.rack_request_span']
do_tagging(span, env)
span.deferred_finish
end
@app.call(env)
end
end
Ideally the developer could configure a proc to run (e.g. for tagging) after the response is sent to user, but before the span is sent to DataDog.
Describe alternatives you've considered
If the proc could execute on the background thread that writes the traces, that would also speed up response times, but that would require the developer to consider the thread safety of the proc (I'm pretty sure our tagging code is not thread safe)
Our application spends a considerable amount of time (p50 10ms, p95 25ms) adding tags to traces on every single request.
@sircpl Can you clarify on this? It's really adding 10ms on average to add a tag to a span? How was this measured? Adding tags should not take this long, so I'd very much like to replicate and benchmark this.
Generally speaking, the instrumentation is meant to (and built to) minimize overhead on the "critical path": the user-managed thread running operations for the user's application. To accomplish this, we try to do the bare minimum to preserve the trace API functionality, and defer heavier tasks to the background thread (e.g. serialization, transmission, etc.)
The instrumentation must collect metadata during the "critical path": some important information mutates over the life of the request, or may become entirely unavailable (drop out of scope) beyond the lifetime of the operation being measured. In this we don't have a choice.
Where we do have a choice is how this metadata is recorded, serialized, and transmitted. However, what we're doing today is already effectively "deferral": we don't serialize or transmit these tags in the user thread. Supposedly the foreground overhead is from the act of recording, and we can't defer this (per the prior paragraph).
All I can constructively suggest at this point is:
- Verify that recording tags is actually the problem: measure & reproduce this in a repeatable test where we can confidently describe the inputs that reproduce the poor performance. From this we can establish clear expectations.
- Based on what inputs we know that cause the poor performance, look for optimizations in the existing
set_tag
behavior. - If that is as lean as can be, and performance is still problematic, then look for design changes that permit more performant recording.
I don't think we're ready to take on a larger design change to the API until we've done the due diligence above.
Beyond that, there are improvements I've been wanting to make to the API, namely separating the instrumentation parts from the parts under test. e.g. something like:
Datadog::Tracing.trace('http.request') do |span|
span.instrument { client.get(request) }
span.before { span.set_tag('http.request.path', request.path) }
span.after { |response| span.set_tag('http.response.code', response.code) }
end
This would allow for (1) more accurate measurements of the code under test and (2) make it easier to skip instrumentation when not necessary. There are challenges in implementing such changes though, mostly in closures/safety and in breaking changes to the existing API. I don't think this is actionable short term until we can take more time to scope a technical design and prototype.
@sircpl, SpanOperation#set_tag
has shown before in our benchmarks as a possible point of optimization.
It is a non-trivial operation, that is layered with analytics.enabled
logic.
Could you provide us with the tagging pattern you are using (tag keys and values, even if redacted) so we can craft a benchmark to measure and look for optimizations for your use case? A description of the trace graph (what spans you are seeing in the relevant trace) would also help, alongside timings of the relevant spans so we can measure the relative impact of set_tag
to the code path as a whole.
@sircpl Can you clarify on this? It's really adding 10ms on average to add a tag to a span? How was this measured? Adding tags should not take this long, so I'd very much like to replicate and benchmark this.
@delner I'm afraid my description of the problem was misleading. We don't spend 10ms adding tags to a span. We spend 10ms looking up information (from the database, cache, etc) so that it can be added to a span.
The API changes that you suggest (separation of instrumentation from the rest) seem like they would fit the bill. It would provide the developer an easy way to tag the span off the critical path.
Nothing to announce here yet, but it's interesting to see that github had a similar issue: https://github.blog/2022-04-11-performance-at-github-deferring-stats-with-rack-after_reply/