dd-trace-rb icon indicating copy to clipboard operation
dd-trace-rb copied to clipboard

Traces not tagged with resources when upgrading from 0.54.2 -> 1.1.0 (rails)

Open inverse opened this issue 2 years ago • 12 comments

I upgraded from 0.54.2 to 1.1.0 following the upgrade guide https://github.com/DataDog/dd-trace-rb/blob/master/docs/UpgradeGuide.md#from-0x-to-10

image

image

I noticed that there was a significant drop in the requests being traces as well as being tagged not to resources and instead <VERB> <STATUS> is that expected? Here is a diff of the initializer changes. Did I forget to change something?

diff --git a/config/initializers/datadog.rb b/config/initializers/datadog.rb
index 091e2e831..9d676057d 100644
--- a/config/initializers/datadog.rb
+++ b/config/initializers/datadog.rb
@@ -13,29 +13,31 @@ if Rails.env.production? && datadog_enabled?
   end
 
   Datadog.configure do |config|
-    config.logger.instance = Svc.error_only_logger
+    config.logger = Svc.error_only_logger
     config.service = '<redacted>'
     config.env = Svc.sandbox? ? 'sandbox' : 'production'
-    config.tracing.enabled = true
-    config.tags = {
-      version: Svc.revision,
-      env: Svc.sandbox? ? 'sandbox' : 'production',
-    }
+    config.tracer(
+      enabled: true,
+      tags: {
+        version: Svc.revision,
+        env: Svc.sandbox? ? 'sandbox' : 'production',
+      }
+    )
 
     config.profiling.enabled = profiling_enabled.call
 
-    config.tracing.instrument :rails, service_name: '<redacted>'
-    config.tracing.instrument :faraday, service_name: '<redacted>-faraday'
-    config.tracing.instrument :http, service_name: '<redacted>-http'
-    config.tracing.instrument :redis, service_name: '<redacted>-redis'
-    config.tracing.instrument :aws, service_name: '<redacted>-aws'
+    config.use :rails, service_name: '<redacted>'
+    config.use :faraday, service_name: '<redacted>-faraday'
+    config.use :http, service_name: '<redacted>-http'
+    config.use :redis, service_name: '<redacted>-redis'
+    config.use :aws, service_name: '<redacted>-aws'
 
-    config.tracing.instrument :rack,
+    config.use :rack,
       request_queuing: true,
       web_service_name: '<redacted>-web-server',
       service_name: '<redacted>'
 
-    config.tracing.instrument :sidekiq,
+    config.use :sidekiq,
       service_name: '<redacted>-sidekiq',
       client_service_name: '<redacted>-sidekiq-client',
       tag_args: true

inverse avatar May 31 '22 13:05 inverse

The configuration file is helpful, but I don't see anything particularly concerning in it (only some unrelated suggestions).

I think I need a clearer picture of what in the existing traces changed, and what traces seemingly went missing. Can you share me some screenshots? Particularly I'd like to see 1) 0.x trace vs 1.x trace w/ list of missing resources, and 2) 0.x traces that no longer appear in 1.x.

My suspicion says it may be related to Rack being activated, although I'm not very confident about this. It is usually the instrumentation that sets things like GET 200 for a web request. c.tracing.instrument :rack shouldn't be necessary, as c.tracing.instrument :rails should do this automatically, and apply the instrumentation at the appropriate time. I would try removing that, see if anything happens.

Beyond that, it's best to focus on having some clear comparisons of the differences, and attempt to replicate the problem via some code snippet I can run & debug with.

delner avatar May 31 '22 14:05 delner

Thanks for replying!

Here is a trace from before

image

and after

image

It's a trace spanning between multiple services but calling the same endpoint on the service that got upgraded.

inverse avatar May 31 '22 15:05 inverse

I see that active_support service has appeared on the "after" trace, and that definitely seems wrong. But this indicates to me you're using Redis via ActiveSupport's cache API?

Looking at the code, it looks like ActiveSupport instrumentation will create spans around read/write/fetch/delete operations, and set service equal to Datadog.configuration.tracing[:active_support][:cache_service]. So it looks like in this case, we may want to configure ActiveSupport to set cache_service the same as your c.service name (aka "core").

c.tracing.instrument :active_support, service_name: c.service # Or "core"

You could also associate it with your Redis service instead, but I don't think I'd recommend this given it will add the time it spends in Ruby to your Redis timings (introduce inaccuracy).

In theory, this is something we should handle automatically with our instrumentation, but it's an edge case with a slightly more sophisticated solution. Hopefully the above code snippet won't be necessary down the line.

delner avatar May 31 '22 18:05 delner

Giving this another look, I think a big part of the issue may be that the resource name (provided by the Rails instrumentation) is not appropriately being set upon the trace. I see the resource name being set on the 2nd service correctly, but not the first.

This may be why you're seeing GET 200 instead of <controller>#<action> as you had before. Is this accurate? Can you confirm this is what you're seeing?

If so, we need to dig a bit deeper to see why the resource name didn't get set on the trace properly. This behavior did change in 1.0. Behind the scenes, the nice-looking Rails resource name is first set on the trace, before being copied onto the root span before its sent. This is supposed to replicate behavior from 0.x. However, if there's some kind of bug in this process, it may explain why you're seeing GET 200 instead, which is the Rack default resource.

Just to be clear, if this issue is resolved, I would expect you to see the controller name from the first half of that distributed trace (action_controller span) to appear as the resource name for the trace, not the second action_controller span.

Can you provide any other code, Gemfile, application structure that may help me reproduce the issue? I haven't yet been able to recreate this behavior in my Rails applications.

delner avatar Jun 01 '22 17:06 delner

Looking at the code, it looks like ActiveSupport instrumentation will create spans around read/write/fetch/delete operations, and set service equal to Datadog.configuration.tracing[:active_support][:cache_service]. So it looks like in this case, we may want to configure ActiveSupport to set cache_service the same as your c.service name (aka "core").

c.tracing.instrument :active_support, service_name: c.service # Or "core"

Yeah right now it's not being tagged with the service name and I guess would make these traces clearer... core is a service that invoked this other service that was upgraded.

This may be why you're seeing GET 200 instead of # as you had before. Is this accurate? Can you confirm this is what you're seeing?

Correct - what's interesting is it's not all requests that are tagged like that. Some are being tagged with model names or other classes. Looks like it's listing those sub-traces? as unique traces? especially if you look at the timings.

image

inverse avatar Jun 02 '22 11:06 inverse

Generally speaking, the first span in a trace determines the resource (except in certain web frameworks, where the resource name is overridden by the controller's resource name). So if you're seeing tiny traces (microseconds) those are probably fragmented traces, which would be incorrect behavior.

We're seeing similar behavior in #2056 , so there may be a common cause... something like trace context being incorrectly resolved.

Unfortunately I'm unable to replicate this behavior locally. Can you share anything that could help? If I can manage to, I should be able to debug and fix it.

delner avatar Jun 03 '22 14:06 delner

I had the same issue with missed APM data after upgrading to V1. The quick solution is to set service_name for each instrument manually

Here is the example:

    Datadog.configure do |config|
      # Global settings
      config.service = Sidekiq.server? ? 'sidekiq' : 'api'
      .....

      config.tracing.instrument(:sidekiq, service_name: "#{config.service}-sidekiq")
      config.tracing.instrument(:que, service_name: "#{config.service}-que") # https://docs.datadoghq.com/tracing/setup_overview/setup/ruby#que

      config.tracing.instrument(:aws, service_name: "#{config.service}-aws")

      config.tracing.instrument(:concurrent_ruby, service_name: "#{config.service}-concurrent_ruby")
      config.tracing.instrument(:graphql, service_name: "#{config.service}-graphql", schemas: [Schema1, Schema2])
      ....

So looks like it doesn't use both config.service as prefix and instrument's name and puts all in the same bucket

olkeene avatar Jun 03 '22 18:06 olkeene

Giving this another look, I think a big part of the issue may be that the resource name (provided by the Rails instrumentation) is not appropriately being set upon the trace. I see the resource name being set on the 2nd service correctly, but not the first.

Unfortunately I'm unable to replicate this behavior locally. Can you share anything that could help? If I can manage to, I should be able to debug and fix it.

I have the same issue. I think the problem is somehow related to config.tracing.instrument :rack with enabled request queuing.

For example, we have 2 endpoints:

  1. The first endpoint receives X-Request-Start header and resource names are missing in its traces.
  2. The second endpoint never receives X-Request-Start header and its traces look correct

Also after removing rack instrumentation completely (and leaving on rails instrumentation), traces of the first resource were fixed.

llxff avatar Jun 13 '22 13:06 llxff

I've also encountered this when upgrading from 0.x to 1.1.0.

Could it be related to this commit removing this logic of copying resource name into the rack request span? https://github.com/DataDog/dd-trace-rb/commit/f7b664bf196db7197010491aaa96c2174cb773fb#diff-c7c7564cc5bbf99b570001291ddb26f2f03c6804a3727a0a721a8da8658eaeefL45

Update: I can confirm that 1.0.0 is also affected.

tomasv avatar Jun 15 '22 12:06 tomasv

I made this monkey patch that ports the old try_setting_rack_request_resource into 1.1.0 compatible API. As a workaround this fixes the spans for me:

# config/initializers/datadog.rb

module RackResourceNamePatch
  def start_processing(payload)
    super
    tracing_context = payload.fetch(:tracing_context)
    span = tracing_context[:dd_request_span]
    try_setting_rack_request_resource(payload, span.resource) if span
  rescue StandardError => e
    Datadog.logger.error(e.message)
  end

  def finish_processing(payload)
    tracing_context = payload.fetch(:tracing_context)
    span = tracing_context[:dd_request_span]
    try_setting_rack_request_resource(payload, span.resource) if span && !span.finished?
    super
  rescue StandardError => e
    Datadog.logger.error(e.message)
  end

  def try_setting_rack_request_resource(payload, resource)
    unless payload.fetch(:exception_controller?)
      rack_request_span = payload.fetch(:env)[Datadog::Tracing::Contrib::Rack::Ext::RACK_ENV_REQUEST_SPAN]
      rack_request_span.resource = resource.to_s if rack_request_span
    end
  end
end

Datadog::Tracing::Contrib::ActionPack::ActionController::Instrumentation.singleton_class.prepend(RackResourceNamePatch)

Datadog.configure do |c|
  # usual stuff here
end

I don't have enough knowledge of the internals to say what the correct solution would be.

tomasv avatar Jun 15 '22 12:06 tomasv

Similar to what @llxff said, we're seeing this and I believe it's due to our enabling of the :request_queuing configuration option in the Rack instrumentation. That becomes the top-level span (if present), and it's resource_name is set to [controller]#[action], while the rack.request span's resource_name is set to the fallback of [method] [status].

Interestingly, we were actually seeing the same behavior on 0.x, I just didn't notice it until doing the upgrade and paying closer attention to the resource_names.

Update: Slight correction, on 0.x, while we still saw the [method] [status] resource name set on rack.request spans, http_server.queue spans had no resource name prior to 1.x. After 1.x, they now get the [controller]#[action] resource name.

agrobbin avatar Jun 15 '22 12:06 agrobbin

I see that active_support service has appeared on the "after" trace, and that definitely seems wrong. But this indicates to me you're using Redis via ActiveSupport's cache API?

Looking at the code, it looks like ActiveSupport instrumentation will create spans around read/write/fetch/delete operations, and set service equal to Datadog.configuration.tracing[:active_support][:cache_service]. So it looks like in this case, we may want to configure ActiveSupport to set cache_service the same as your c.service name (aka "core").

c.tracing.instrument :active_support, service_name: c.service # Or "core"

I would add when doing this, these resources do not show up under the service(core) because they are no longer service entry spans. Should active_support be creating a new service or embedded as a component with resources in that service? Maybe you mean setting the cache_service to core-cache? and service_name => cache_service

If its not meant to be its own service won't the instrumentation code need an update to include Datadog::Tracing::Contrib::Analytics.set_measured calls

jerny-lantern avatar Jun 17 '22 19:06 jerny-lantern

@inverse @olkeene @tomasv @agrobbin @jerny-lantern

FWIW, I ran some more tests of my own on this: results suggest there's no data loss between tracer and UI. Maybe there's something going on in the instrumentation, but I think this is more likely a case of confusion surrounding how dimensional data appears in the UI, after some resource/service names changed in 1.x.

Here's more about those tests: https://github.com/DataDog/dd-trace-rb/issues/2101#issuecomment-1219991973

Also, we've since released 1.3.0 which had a series of fixes. In regards to this resource name behavior, 1.3.0 now names the rack.request span with the Rails controller name regardless of whether request_queuing is enabled or not. Although this makes the resource name a bit redundant (appearing on 3 spans), we hope it might help smooth over some cases like this.

Please try 1.3.0 and let me know what you think, or if you've discovered anything else new that we should know about/fix. I would like to draw this one to a conclusion. Thanks!

delner avatar Aug 18 '22 21:08 delner

@delner am I right in thinking that the change you alluded to in v1.3.0 wrt the rack.request span name is #2180? If so, that makes a lot of sense, and likely resolves the issue we were experiencing! The redundancy seems appropriate for the situation.

agrobbin avatar Aug 19 '22 16:08 agrobbin

Hey folks,

I believe this one was likely fixed in 1.3.0. Since then, we've also released 1.4.1 that has another very important fix involving metrics (see #2101). I would upgrade to that latest version.

Going to close this, but if its unresolved, please ping/re-open. Thanks!

delner avatar Sep 16 '22 03:09 delner

@delner just to confirm upgrading to 1.4.1 fixed our issues - Thanks :pray:

inverse avatar Sep 21 '22 14:09 inverse