dd-trace-rb
dd-trace-rb copied to clipboard
Traces not tagged with resources when upgrading from 0.54.2 -> 1.1.0 (rails)
I upgraded from 0.54.2
to 1.1.0
following the upgrade guide https://github.com/DataDog/dd-trace-rb/blob/master/docs/UpgradeGuide.md#from-0x-to-10
I noticed that there was a significant drop in the requests being traces as well as being tagged not to resources and instead <VERB> <STATUS>
is that expected? Here is a diff of the initializer changes. Did I forget to change something?
diff --git a/config/initializers/datadog.rb b/config/initializers/datadog.rb
index 091e2e831..9d676057d 100644
--- a/config/initializers/datadog.rb
+++ b/config/initializers/datadog.rb
@@ -13,29 +13,31 @@ if Rails.env.production? && datadog_enabled?
end
Datadog.configure do |config|
- config.logger.instance = Svc.error_only_logger
+ config.logger = Svc.error_only_logger
config.service = '<redacted>'
config.env = Svc.sandbox? ? 'sandbox' : 'production'
- config.tracing.enabled = true
- config.tags = {
- version: Svc.revision,
- env: Svc.sandbox? ? 'sandbox' : 'production',
- }
+ config.tracer(
+ enabled: true,
+ tags: {
+ version: Svc.revision,
+ env: Svc.sandbox? ? 'sandbox' : 'production',
+ }
+ )
config.profiling.enabled = profiling_enabled.call
- config.tracing.instrument :rails, service_name: '<redacted>'
- config.tracing.instrument :faraday, service_name: '<redacted>-faraday'
- config.tracing.instrument :http, service_name: '<redacted>-http'
- config.tracing.instrument :redis, service_name: '<redacted>-redis'
- config.tracing.instrument :aws, service_name: '<redacted>-aws'
+ config.use :rails, service_name: '<redacted>'
+ config.use :faraday, service_name: '<redacted>-faraday'
+ config.use :http, service_name: '<redacted>-http'
+ config.use :redis, service_name: '<redacted>-redis'
+ config.use :aws, service_name: '<redacted>-aws'
- config.tracing.instrument :rack,
+ config.use :rack,
request_queuing: true,
web_service_name: '<redacted>-web-server',
service_name: '<redacted>'
- config.tracing.instrument :sidekiq,
+ config.use :sidekiq,
service_name: '<redacted>-sidekiq',
client_service_name: '<redacted>-sidekiq-client',
tag_args: true
The configuration file is helpful, but I don't see anything particularly concerning in it (only some unrelated suggestions).
I think I need a clearer picture of what in the existing traces changed, and what traces seemingly went missing. Can you share me some screenshots? Particularly I'd like to see 1) 0.x trace vs 1.x trace w/ list of missing resources, and 2) 0.x traces that no longer appear in 1.x.
My suspicion says it may be related to Rack being activated, although I'm not very confident about this. It is usually the instrumentation that sets things like GET 200
for a web request. c.tracing.instrument :rack
shouldn't be necessary, as c.tracing.instrument :rails
should do this automatically, and apply the instrumentation at the appropriate time. I would try removing that, see if anything happens.
Beyond that, it's best to focus on having some clear comparisons of the differences, and attempt to replicate the problem via some code snippet I can run & debug with.
Thanks for replying!
Here is a trace from before
and after
It's a trace spanning between multiple services but calling the same endpoint on the service that got upgraded.
I see that active_support
service has appeared on the "after" trace, and that definitely seems wrong. But this indicates to me you're using Redis via ActiveSupport's cache API?
Looking at the code, it looks like ActiveSupport instrumentation will create spans around read/write/fetch/delete operations, and set service
equal to Datadog.configuration.tracing[:active_support][:cache_service]
. So it looks like in this case, we may want to configure ActiveSupport to set cache_service
the same as your c.service
name (aka "core").
c.tracing.instrument :active_support, service_name: c.service # Or "core"
You could also associate it with your Redis service instead, but I don't think I'd recommend this given it will add the time it spends in Ruby to your Redis timings (introduce inaccuracy).
In theory, this is something we should handle automatically with our instrumentation, but it's an edge case with a slightly more sophisticated solution. Hopefully the above code snippet won't be necessary down the line.
Giving this another look, I think a big part of the issue may be that the resource name (provided by the Rails instrumentation) is not appropriately being set upon the trace. I see the resource name being set on the 2nd service correctly, but not the first.
This may be why you're seeing GET 200
instead of <controller>#<action>
as you had before. Is this accurate? Can you confirm this is what you're seeing?
If so, we need to dig a bit deeper to see why the resource name didn't get set on the trace properly. This behavior did change in 1.0. Behind the scenes, the nice-looking Rails resource name is first set on the trace, before being copied onto the root span before its sent. This is supposed to replicate behavior from 0.x. However, if there's some kind of bug in this process, it may explain why you're seeing GET 200
instead, which is the Rack default resource.
Just to be clear, if this issue is resolved, I would expect you to see the controller name from the first half of that distributed trace (action_controller
span) to appear as the resource name for the trace, not the second action_controller
span.
Can you provide any other code, Gemfile, application structure that may help me reproduce the issue? I haven't yet been able to recreate this behavior in my Rails applications.
Looking at the code, it looks like ActiveSupport instrumentation will create spans around read/write/fetch/delete operations, and set service equal to Datadog.configuration.tracing[:active_support][:cache_service]. So it looks like in this case, we may want to configure ActiveSupport to set cache_service the same as your c.service name (aka "core").
c.tracing.instrument :active_support, service_name: c.service # Or "core"
Yeah right now it's not being tagged with the service name and I guess would make these traces clearer... core
is a service that invoked this other service that was upgraded.
This may be why you're seeing GET 200 instead of
# as you had before. Is this accurate? Can you confirm this is what you're seeing?
Correct - what's interesting is it's not all requests that are tagged like that. Some are being tagged with model names or other classes. Looks like it's listing those sub-traces? as unique traces? especially if you look at the timings.
Generally speaking, the first span in a trace determines the resource (except in certain web frameworks, where the resource name is overridden by the controller's resource name). So if you're seeing tiny traces (microseconds) those are probably fragmented traces, which would be incorrect behavior.
We're seeing similar behavior in #2056 , so there may be a common cause... something like trace context being incorrectly resolved.
Unfortunately I'm unable to replicate this behavior locally. Can you share anything that could help? If I can manage to, I should be able to debug and fix it.
I had the same issue with missed APM data after upgrading to V1.
The quick solution is to set service_name
for each instrument manually
Here is the example:
Datadog.configure do |config|
# Global settings
config.service = Sidekiq.server? ? 'sidekiq' : 'api'
.....
config.tracing.instrument(:sidekiq, service_name: "#{config.service}-sidekiq")
config.tracing.instrument(:que, service_name: "#{config.service}-que") # https://docs.datadoghq.com/tracing/setup_overview/setup/ruby#que
config.tracing.instrument(:aws, service_name: "#{config.service}-aws")
config.tracing.instrument(:concurrent_ruby, service_name: "#{config.service}-concurrent_ruby")
config.tracing.instrument(:graphql, service_name: "#{config.service}-graphql", schemas: [Schema1, Schema2])
....
So looks like it doesn't use both config.service
as prefix and instrument's name and puts all in the same bucket
Giving this another look, I think a big part of the issue may be that the resource name (provided by the Rails instrumentation) is not appropriately being set upon the trace. I see the resource name being set on the 2nd service correctly, but not the first.
Unfortunately I'm unable to replicate this behavior locally. Can you share anything that could help? If I can manage to, I should be able to debug and fix it.
I have the same issue. I think the problem is somehow related to config.tracing.instrument :rack
with enabled request queuing.
For example, we have 2 endpoints:
- The first endpoint receives
X-Request-Start
header and resource names are missing in its traces. - The second endpoint never receives
X-Request-Start
header and its traces look correct
Also after removing rack
instrumentation completely (and leaving on rails instrumentation), traces of the first resource were fixed.
I've also encountered this when upgrading from 0.x to 1.1.0.
Could it be related to this commit removing this logic of copying resource name into the rack request span? https://github.com/DataDog/dd-trace-rb/commit/f7b664bf196db7197010491aaa96c2174cb773fb#diff-c7c7564cc5bbf99b570001291ddb26f2f03c6804a3727a0a721a8da8658eaeefL45
Update: I can confirm that 1.0.0 is also affected.
I made this monkey patch that ports the old try_setting_rack_request_resource
into 1.1.0 compatible API. As a workaround this fixes the spans for me:
# config/initializers/datadog.rb
module RackResourceNamePatch
def start_processing(payload)
super
tracing_context = payload.fetch(:tracing_context)
span = tracing_context[:dd_request_span]
try_setting_rack_request_resource(payload, span.resource) if span
rescue StandardError => e
Datadog.logger.error(e.message)
end
def finish_processing(payload)
tracing_context = payload.fetch(:tracing_context)
span = tracing_context[:dd_request_span]
try_setting_rack_request_resource(payload, span.resource) if span && !span.finished?
super
rescue StandardError => e
Datadog.logger.error(e.message)
end
def try_setting_rack_request_resource(payload, resource)
unless payload.fetch(:exception_controller?)
rack_request_span = payload.fetch(:env)[Datadog::Tracing::Contrib::Rack::Ext::RACK_ENV_REQUEST_SPAN]
rack_request_span.resource = resource.to_s if rack_request_span
end
end
end
Datadog::Tracing::Contrib::ActionPack::ActionController::Instrumentation.singleton_class.prepend(RackResourceNamePatch)
Datadog.configure do |c|
# usual stuff here
end
I don't have enough knowledge of the internals to say what the correct solution would be.
Similar to what @llxff said, we're seeing this and I believe it's due to our enabling of the :request_queuing
configuration option in the Rack instrumentation. That becomes the top-level span (if present), and it's resource_name
is set to [controller]#[action]
, while the rack.request
span's resource_name
is set to the fallback of [method] [status]
.
Interestingly, we were actually seeing the same behavior on 0.x, I just didn't notice it until doing the upgrade and paying closer attention to the resource_name
s.
Update: Slight correction, on 0.x, while we still saw the [method] [status]
resource name set on rack.request
spans, http_server.queue
spans had no resource name prior to 1.x. After 1.x, they now get the [controller]#[action]
resource name.
I see that
active_support
service has appeared on the "after" trace, and that definitely seems wrong. But this indicates to me you're using Redis via ActiveSupport's cache API?Looking at the code, it looks like ActiveSupport instrumentation will create spans around read/write/fetch/delete operations, and set
service
equal toDatadog.configuration.tracing[:active_support][:cache_service]
. So it looks like in this case, we may want to configure ActiveSupport to setcache_service
the same as yourc.service
name (aka "core").
c.tracing.instrument :active_support, service_name: c.service # Or "core"
I would add when doing this, these resources do not show up under the service(core) because they are no longer service entry spans. Should active_support be creating a new service
or embedded as a component with resources in that service? Maybe you mean setting the cache_service to core-cache
? and service_name => cache_service
If its not meant to be its own service won't the instrumentation code need an update to include Datadog::Tracing::Contrib::Analytics.set_measured
calls
@inverse @olkeene @tomasv @agrobbin @jerny-lantern
FWIW, I ran some more tests of my own on this: results suggest there's no data loss between tracer and UI. Maybe there's something going on in the instrumentation, but I think this is more likely a case of confusion surrounding how dimensional data appears in the UI, after some resource/service names changed in 1.x.
Here's more about those tests: https://github.com/DataDog/dd-trace-rb/issues/2101#issuecomment-1219991973
Also, we've since released 1.3.0 which had a series of fixes. In regards to this resource name behavior, 1.3.0 now names the rack.request
span with the Rails controller name regardless of whether request_queuing
is enabled or not. Although this makes the resource name a bit redundant (appearing on 3 spans), we hope it might help smooth over some cases like this.
Please try 1.3.0 and let me know what you think, or if you've discovered anything else new that we should know about/fix. I would like to draw this one to a conclusion. Thanks!
@delner am I right in thinking that the change you alluded to in v1.3.0 wrt the rack.request
span name is #2180? If so, that makes a lot of sense, and likely resolves the issue we were experiencing! The redundancy seems appropriate for the situation.
Hey folks,
I believe this one was likely fixed in 1.3.0. Since then, we've also released 1.4.1 that has another very important fix involving metrics (see #2101). I would upgrade to that latest version.
Going to close this, but if its unresolved, please ping/re-open. Thanks!
@delner just to confirm upgrading to 1.4.1 fixed our issues - Thanks :pray: