dd-trace-rb Allow Developers to Tag Traces After Response Sent To User

Is your feature request related to a problem? Please describe.

Our application spends a considerable amount of time (p50 10ms, p95 25ms) adding tags to traces on every single request. The tags are essential to our observability, but adding them slows down requests for users.

Describe the goal of the feature

The goal of the feature is to make it easy for developers to add tags to a trace after the response is sent to the user. This could be accomplished using rack after-reply procs (supported by puma, unicorn, maybe others?). We were able to do this but it involved monkey patching the rack request span to defer finishing:

class << span
  alias_method :original_finish, :finish

  def finish(_end_time = nil)
    # Record the time that the span is finished so we can pass it along later
    @finished_at = Datadog::Core::Utils::Time.now.utc
  end

  def deferred_finish
    original_finish(@finished_at)
  end
end

class AfterReplyMiddleware
  ...
  def call(env)
    env['rack.after_reply'] ||= []
    env['rack.after_reply'] << -> do
      span = env['datadog.rack_request_span']
      do_tagging(span, env)
      span.deferred_finish
    end
    @app.call(env)
  end
end

Ideally the developer could configure a proc to run (e.g. for tagging) after the response is sent to user, but before the span is sent to DataDog.

Describe alternatives you've considered

If the proc could execute on the background thread that writes the traces, that would also speed up response times, but that would require the developer to consider the thread safety of the proc (I'm pretty sure our tagging code is not thread safe)

Apr 11 '23 23:04 sircpl

Our application spends a considerable amount of time (p50 10ms, p95 25ms) adding tags to traces on every single request.

@sircpl Can you clarify on this? It's really adding 10ms on average to add a tag to a span? How was this measured? Adding tags should not take this long, so I'd very much like to replicate and benchmark this.

Apr 17 '23 19:04 delner

Generally speaking, the instrumentation is meant to (and built to) minimize overhead on the "critical path": the user-managed thread running operations for the user's application. To accomplish this, we try to do the bare minimum to preserve the trace API functionality, and defer heavier tasks to the background thread (e.g. serialization, transmission, etc.)

The instrumentation must collect metadata during the "critical path": some important information mutates over the life of the request, or may become entirely unavailable (drop out of scope) beyond the lifetime of the operation being measured. In this we don't have a choice.

Where we do have a choice is how this metadata is recorded, serialized, and transmitted. However, what we're doing today is already effectively "deferral": we don't serialize or transmit these tags in the user thread. Supposedly the foreground overhead is from the act of recording, and we can't defer this (per the prior paragraph).

All I can constructively suggest at this point is:

Verify that recording tags is actually the problem: measure & reproduce this in a repeatable test where we can confidently describe the inputs that reproduce the poor performance. From this we can establish clear expectations.
Based on what inputs we know that cause the poor performance, look for optimizations in the existing set_tag behavior.
If that is as lean as can be, and performance is still problematic, then look for design changes that permit more performant recording.

I don't think we're ready to take on a larger design change to the API until we've done the due diligence above.

Beyond that, there are improvements I've been wanting to make to the API, namely separating the instrumentation parts from the parts under test. e.g. something like:

Datadog::Tracing.trace('http.request') do |span|
  span.instrument { client.get(request) }
  span.before { span.set_tag('http.request.path', request.path) }
  span.after { |response| span.set_tag('http.response.code', response.code) }
end

This would allow for (1) more accurate measurements of the code under test and (2) make it easier to skip instrumentation when not necessary. There are challenges in implementing such changes though, mostly in closures/safety and in breaking changes to the existing API. I don't think this is actionable short term until we can take more time to scope a technical design and prototype.

Apr 17 '23 20:04 delner

@sircpl, SpanOperation#set_tag has shown before in our benchmarks as a possible point of optimization.

It is a non-trivial operation, that is layered with analytics.enabled logic.

Could you provide us with the tagging pattern you are using (tag keys and values, even if redacted) so we can craft a benchmark to measure and look for optimizations for your use case? A description of the trace graph (what spans you are seeing in the relevant trace) would also help, alongside timings of the relevant spans so we can measure the relative impact of set_tag to the code path as a whole.

Apr 18 '23 19:04 marcotc

@sircpl Can you clarify on this? It's really adding 10ms on average to add a tag to a span? How was this measured? Adding tags should not take this long, so I'd very much like to replicate and benchmark this.

@delner I'm afraid my description of the problem was misleading. We don't spend 10ms adding tags to a span. We spend 10ms looking up information (from the database, cache, etc) so that it can be added to a span.

The API changes that you suggest (separation of instrumentation from the rest) seem like they would fit the bill. It would provide the developer an easy way to tag the span off the critical path.

Apr 28 '23 17:04 sircpl

Nothing to announce here yet, but it's interesting to see that github had a similar issue: https://github.blog/2022-04-11-performance-at-github-deferring-stats-with-rack-after_reply/

Nov 07 '23 17:11 ivoanjo

dd-trace-rb dd-trace-rb copied to clipboard

Allow Developers to Tag Traces After Response Sent To User

dd-trace-rb
dd-trace-rb copied to clipboard