opentelemetry-ruby-contrib
opentelemetry-ruby-contrib copied to clipboard
Mysql instrumentation creates one trace per sql statement
I've added Rack, Rails and Mysql2 instrumentation to an unremarkable Rails 6.0 project.
I get the Rack and Rails instrumentation details in one trace for the get request and then each sql statement has its own trace so they can't put on the same tree as they should be. It looks like this:
Where on the left I'm printing trace ids and every trace there except
http get
has one only span with the sql statement.
I do not see any obvious reason why this should end up like this, the mysql2 instrumentation uses the same tracer method than the rack instrumentation, both inherit from the same base that gets the tracer from the same place.
Any pointers?
Hey @jcarres-mdsol, I've tried to recreate this with a new rails 6.0
project.
My opentelemetry configuration.
require 'opentelemetry/sdk'
require 'opentelemetry/exporter/jaeger'
require 'opentelemetry-instrumentation-mysql2'
require 'opentelemetry-instrumentation-rails'
OpenTelemetry::SDK.configure do |c|
c.use_all
c.service_name = 'sql_instrumentation_demo'
c.add_span_processor(OpenTelemetry::SDK::Trace::Export::BatchSpanProcessor.new(
exporter: OpenTelemetry::Exporter::Jaeger::AgentExporter.new
))
end
Added a users model
class CreateUsers < ActiveRecord::Migration[6.0]
def change
create_table :users do |t|
t.string :name
t.timestamps
end
end
end
I created an index action
class UsersController < ApplicationController
def index
@users = User.all
end
end
The view just loops through the users and prints out their names.
When I hit the users#index I get the following trace.
The mysql2
instrumentation is using in_span
which calls start_span, start_span
will attempt to use the parent context if it can be found using current_span
.
Is it possible that in your application the mysql statements are happening outside the context of the rack request (background job, or deferred in some other way)?
Thank you so much for the effort!
This application indeed has background workers but I think these statements happen from the rails app because I was sending http requests to the API provided by that app and the traces from that lacked the DB spans. Those requests should be hitting the DB.
We are using delayed job and tried adding that instrumentation but does not seem to change the spans we get. I've just noticed that we are logging OpenTelemetry error: invalid attribute value type NilClass
from the worker containers so something definitely fishy is going on there. Will investigate further.
Are you using the pool
attribute in your config/db.yaml? I was thinking maybe the instrumentation does not work well across threads and the activerecod thread pool could be breaking it. If you are not using it, could you retry with pool: 3
or whatever, just to see if that is the case?
This application indeed has background workers but I think these statements happen from the rails app because I was sending http requests to the API provided by that app and the traces from that lacked the DB spans. Those requests should be hitting the DB.
Yeah I expect that most typical rails apps will perform a mysql statement within the context of the request, but it was worth checking to see if you're doing something different.
I've just noticed that we are logging
OpenTelemetry error: invalid attribute value type NilClass
from the worker containers so something definitely fishy is going on there. Will investigate further.
This message gets emitted here. I recently came across this recently for rack myself, it has been fixed but has not yet been released. If you find it is coming from the instrumentation please do open an issue (or PR) so we can fix it.
I don't believe this would explain the issue in regards to the disconnected mysql spans, but it's still worth investigating.
Are you using the pool attribute in your config/db.yaml? I was thinking maybe the instrumentation does not work well across threads and the activerecord thread pool could be breaking it. If you are not using it, could you retry with pool: 3 or whatever, just to see if that is the case?
This is the db config I was using, so it appears I had the pool defaulted to 5
default: &default
adapter: mysql2
encoding: utf8mb4
pool: <%= ENV.fetch("RAILS_MAX_THREADS") { 5 } %>
username: ###
password: ###
host: ###
development:
<<: *default
database: sqlinstrumentation_development
test:
<<: *default
database: sqlinstrumentation_test
I've also ran this in a production environment with a larger pool value and have not observed this issue.
I'm happy to try and recreate this with some modifications (within reason) to the demo app I've created for this if you're able to share some more details about the general configuration of your application.
Thanks for the offer. This application has background workers checking the health of other services and an API to access the results of those health checks. Everything is a different container running on AWS ECS. It is pretty much an out-of-the-mill Rails app, can not think of anything unique about it.
As a reference below the Gemfile (removed our private gems, one of which adds otel library otherwise I think they are unrelated), maybe something there catches your attention, I do not think it is worth replicating it, I'll try a smaller app and see what's going on.
source "https://rubygems.org"
gem "rails", "~> 6.0.0"
gem "bootsnap", "~> 1.3", require: false
gem "kaminari", "~> 1.1"
gem "mauth-client", "~> 6.0"
gem "mysql2", "~> 0.5.2"
gem "puma", "~> 4.0"
gem "daemons", "~> 1.3"
gem "delayed_job_active_record", "~> 4.1.1"
gem "dice_bag", "~> 1.1"
gem "net-http-persistent", "~> 4.0"
gem "rack-accept", "~> 0.4"
gem "state_machines-activerecord", "~> 0.6.0"
gem "whenever", "~> 1.0", require: false
@robertlaurin an update on this. Indeed the spans had no parents because I added the delayed job library incorrectly. Now works well.
But it has surfaced another problem which delayed job when backed up by the DB.
The mysql instrumentation picks everything so it will pickup the calls to the DB by the job manager of delayed job. Which I think the delayed job instrumentation will not add to any span because maybe is not really part of the request but also not part of the worker. For instance this kind of update:
UPDATE `delayed_jobs` SET `delayed_jobs`.`locked_at` = '2021-01-20 00:27:53', `delayed_jobs`.`locked_by` = 'delayed_job host:19f764c6bd79 pid:1' WHERE (((run_at <= '2021-01-20 00:27:53.657744' AND (locked_at IS NULL OR locked_at < '2021-01-19 20:27:53.657770')) OR locked_by = 'delayed_job host:19f764c6bd79 pid:1') AND failed_at IS NULL) ORDER BY priority ASC, run_at ASC LIMIT 1
@jcarres-mdsol thank you for the update and sharing your findings.
This is very similar to something I'm working on in regards to sidekiq
and the sidekiq-scheduler
gem that is producing many redis spans without any parent trace. There has been some discussion around how to handle this scenario, in some contexts the application owner may want to ignore these spans, and in others not.
We briefly discussed it on yesterdays SIG meeting, and I was leaning towards having it be configurable as to whether or not we trace "background operations" like this.
I expect there to be more instances similar to this one, so I'll be bringing this topic back up again at the next sig meeting to be decide on how to best address these scenarios.
Just as a hint to others, we ended up solving this internally by creating a custom sampler that drops root spans for database operations (db.system
attribute present)
To extend on @johnduhart 's useful suggestion, here's a POC which worked for me:
class ExcludeDatabaseSampler
def should_sample?(trace_id:, parent_context:, links:, name:, kind:, attributes:)
tracestate = OpenTelemetry::Trace.current_span(parent_context).context.tracestate
decision = if sample?(attributes)
OpenTelemetry::SDK::Trace::Samplers::Decision::RECORD_AND_SAMPLE
else
OpenTelemetry::SDK::Trace::Samplers::Decision::DROP
end
OpenTelemetry::SDK::Trace::Samplers::Result.new(decision: decision, tracestate: tracestate)
end
private
def sample?(attributes)
!(attributes && attributes['db.system'])
end
end
OpenTelemetry::SDK.configure do |c|
c.use_all
end
sampler = ExcludeDatabaseSampler.new
OpenTelemetry.tracer_provider.sampler = OpenTelemetry::SDK::Trace::Samplers.parent_based(root: sampler)
I think in your case you are removing all db calls. We only wanted to remove DB root spans
# Implements a custom sampler that drops root DB spans.
class YourSampler
attr_reader :description
def initialize(non_db_sampler)
@description = "YourSampler"
@non_db_sampler = non_db_sampler
end
# @api private
#
# See {Samplers}.
def should_sample?(trace_id:, parent_context:, links:, name:, kind:, attributes:) # rubocop:disable Metrics/ParameterLists
parent_span_context = OpenTelemetry::Trace.current_span(parent_context).context
if attributes&.key?("db.system") && !parent_span_context.valid?
OpenTelemetry::SDK::Trace::Samplers::Result.new(
decision: OpenTelemetry::SDK::Trace::Samplers::Decision::DROP,
tracestate: parent_span_context.tracestate
)
else
@non_db_sampler.should_sample?(
trace_id: trace_id, parent_context: parent_context, links: links, name: name,
kind: kind, attributes: attributes
)
end
end
end
👋 This issue has been marked as stale because it has been open with no activity. You can: comment on the issue or remove the stale label to hold stale off for a while, add the keep
label to hold stale off permanently, or do nothing. If you do nothing this issue will be closed eventually by the stale bot.