opentelemetry-ruby-contrib icon indicating copy to clipboard operation
opentelemetry-ruby-contrib copied to clipboard

Mysql instrumentation creates one trace per sql statement

Open jcarres-mdsol opened this issue 4 years ago • 9 comments

I've added Rack, Rails and Mysql2 instrumentation to an unremarkable Rails 6.0 project.

I get the Rack and Rails instrumentation details in one trace for the get request and then each sql statement has its own trace so they can't put on the same tree as they should be. It looks like this: Screen Shot 2021-01-14 at 10 27 47 PM Where on the left I'm printing trace ids and every trace there except http get has one only span with the sql statement. I do not see any obvious reason why this should end up like this, the mysql2 instrumentation uses the same tracer method than the rack instrumentation, both inherit from the same base that gets the tracer from the same place.

Any pointers?

jcarres-mdsol avatar Jan 15 '21 06:01 jcarres-mdsol

Hey @jcarres-mdsol, I've tried to recreate this with a new rails 6.0 project.

My opentelemetry configuration.

require 'opentelemetry/sdk'
require 'opentelemetry/exporter/jaeger'
require 'opentelemetry-instrumentation-mysql2'
require 'opentelemetry-instrumentation-rails'

OpenTelemetry::SDK.configure do |c|
  c.use_all
  c.service_name = 'sql_instrumentation_demo'
  c.add_span_processor(OpenTelemetry::SDK::Trace::Export::BatchSpanProcessor.new(
    exporter: OpenTelemetry::Exporter::Jaeger::AgentExporter.new
  ))
end

Added a users model

class CreateUsers < ActiveRecord::Migration[6.0]
  def change
    create_table :users do |t|
      t.string :name
      t.timestamps
    end
  end
end

I created an index action

class UsersController < ApplicationController
  def index
    @users = User.all
  end
end

The view just loops through the users and prints out their names.

When I hit the users#index I get the following trace. image

The mysql2 instrumentation is using in_span which calls start_span, start_span will attempt to use the parent context if it can be found using current_span.

Is it possible that in your application the mysql statements are happening outside the context of the rack request (background job, or deferred in some other way)?

robertlaurin avatar Jan 15 '21 16:01 robertlaurin

Thank you so much for the effort!

This application indeed has background workers but I think these statements happen from the rails app because I was sending http requests to the API provided by that app and the traces from that lacked the DB spans. Those requests should be hitting the DB.

We are using delayed job and tried adding that instrumentation but does not seem to change the spans we get. I've just noticed that we are logging OpenTelemetry error: invalid attribute value type NilClass from the worker containers so something definitely fishy is going on there. Will investigate further.

Are you using the pool attribute in your config/db.yaml? I was thinking maybe the instrumentation does not work well across threads and the activerecod thread pool could be breaking it. If you are not using it, could you retry with pool: 3 or whatever, just to see if that is the case?

jcarres-mdsol avatar Jan 15 '21 19:01 jcarres-mdsol

This application indeed has background workers but I think these statements happen from the rails app because I was sending http requests to the API provided by that app and the traces from that lacked the DB spans. Those requests should be hitting the DB.

Yeah I expect that most typical rails apps will perform a mysql statement within the context of the request, but it was worth checking to see if you're doing something different.

I've just noticed that we are logging OpenTelemetry error: invalid attribute value type NilClass from the worker containers so something definitely fishy is going on there. Will investigate further.

This message gets emitted here. I recently came across this recently for rack myself, it has been fixed but has not yet been released. If you find it is coming from the instrumentation please do open an issue (or PR) so we can fix it.

I don't believe this would explain the issue in regards to the disconnected mysql spans, but it's still worth investigating.

Are you using the pool attribute in your config/db.yaml? I was thinking maybe the instrumentation does not work well across threads and the activerecord thread pool could be breaking it. If you are not using it, could you retry with pool: 3 or whatever, just to see if that is the case?

This is the db config I was using, so it appears I had the pool defaulted to 5

default: &default
  adapter: mysql2
  encoding: utf8mb4
  pool: <%= ENV.fetch("RAILS_MAX_THREADS") { 5 } %>
  username: ###
  password: ###
  host: ###

development:
  <<: *default
  database: sqlinstrumentation_development

test:
  <<: *default
  database: sqlinstrumentation_test

I've also ran this in a production environment with a larger pool value and have not observed this issue.

I'm happy to try and recreate this with some modifications (within reason) to the demo app I've created for this if you're able to share some more details about the general configuration of your application.

robertlaurin avatar Jan 15 '21 19:01 robertlaurin

Thanks for the offer. This application has background workers checking the health of other services and an API to access the results of those health checks. Everything is a different container running on AWS ECS. It is pretty much an out-of-the-mill Rails app, can not think of anything unique about it.

As a reference below the Gemfile (removed our private gems, one of which adds otel library otherwise I think they are unrelated), maybe something there catches your attention, I do not think it is worth replicating it, I'll try a smaller app and see what's going on.

source "https://rubygems.org"

gem "rails", "~> 6.0.0"
gem "bootsnap", "~> 1.3", require: false
gem "kaminari", "~> 1.1"
gem "mauth-client", "~> 6.0"
gem "mysql2", "~> 0.5.2"
gem "puma", "~> 4.0"
gem "daemons", "~> 1.3"
gem "delayed_job_active_record", "~> 4.1.1"
gem "dice_bag", "~> 1.1"
gem "net-http-persistent", "~> 4.0"
gem "rack-accept", "~> 0.4"
gem "state_machines-activerecord", "~> 0.6.0"
gem "whenever", "~> 1.0", require: false

jcarres-mdsol avatar Jan 15 '21 20:01 jcarres-mdsol

@robertlaurin an update on this. Indeed the spans had no parents because I added the delayed job library incorrectly. Now works well.

But it has surfaced another problem which delayed job when backed up by the DB.

The mysql instrumentation picks everything so it will pickup the calls to the DB by the job manager of delayed job. Which I think the delayed job instrumentation will not add to any span because maybe is not really part of the request but also not part of the worker. For instance this kind of update:

UPDATE `delayed_jobs` SET `delayed_jobs`.`locked_at` = '2021-01-20 00:27:53', `delayed_jobs`.`locked_by` = 'delayed_job host:19f764c6bd79 pid:1' WHERE (((run_at <= '2021-01-20 00:27:53.657744' AND (locked_at IS NULL OR locked_at < '2021-01-19 20:27:53.657770')) OR locked_by = 'delayed_job host:19f764c6bd79 pid:1') AND failed_at IS NULL) ORDER BY priority ASC, run_at ASC LIMIT 1

jcarres-mdsol avatar Jan 20 '21 02:01 jcarres-mdsol

@jcarres-mdsol thank you for the update and sharing your findings.

This is very similar to something I'm working on in regards to sidekiq and the sidekiq-scheduler gem that is producing many redis spans without any parent trace. There has been some discussion around how to handle this scenario, in some contexts the application owner may want to ignore these spans, and in others not.

We briefly discussed it on yesterdays SIG meeting, and I was leaning towards having it be configurable as to whether or not we trace "background operations" like this.

I expect there to be more instances similar to this one, so I'll be bringing this topic back up again at the next sig meeting to be decide on how to best address these scenarios.

robertlaurin avatar Jan 20 '21 16:01 robertlaurin

Just as a hint to others, we ended up solving this internally by creating a custom sampler that drops root spans for database operations (db.system attribute present)

johnduhart avatar Feb 03 '21 01:02 johnduhart

To extend on @johnduhart 's useful suggestion, here's a POC which worked for me:

class ExcludeDatabaseSampler
  def should_sample?(trace_id:, parent_context:, links:, name:, kind:, attributes:)
    tracestate = OpenTelemetry::Trace.current_span(parent_context).context.tracestate
    decision = if sample?(attributes)
                 OpenTelemetry::SDK::Trace::Samplers::Decision::RECORD_AND_SAMPLE
               else
                 OpenTelemetry::SDK::Trace::Samplers::Decision::DROP
               end

    OpenTelemetry::SDK::Trace::Samplers::Result.new(decision: decision, tracestate: tracestate)
  end

  private

  def sample?(attributes)
    !(attributes && attributes['db.system'])
  end
end

OpenTelemetry::SDK.configure do |c|
  c.use_all
end

sampler = ExcludeDatabaseSampler.new
OpenTelemetry.tracer_provider.sampler = OpenTelemetry::SDK::Trace::Samplers.parent_based(root: sampler)

knarewski avatar Dec 08 '21 08:12 knarewski

I think in your case you are removing all db calls. We only wanted to remove DB root spans

  # Implements a custom sampler that drops root DB spans.
  class YourSampler
    attr_reader :description

    def initialize(non_db_sampler)
      @description = "YourSampler"
      @non_db_sampler = non_db_sampler
    end

    # @api private
    #
    # See {Samplers}.
    def should_sample?(trace_id:, parent_context:, links:, name:, kind:, attributes:) # rubocop:disable Metrics/ParameterLists
      parent_span_context = OpenTelemetry::Trace.current_span(parent_context).context

      if attributes&.key?("db.system") && !parent_span_context.valid?
        OpenTelemetry::SDK::Trace::Samplers::Result.new(
          decision: OpenTelemetry::SDK::Trace::Samplers::Decision::DROP,
          tracestate: parent_span_context.tracestate
        )
      else
        @non_db_sampler.should_sample?(
          trace_id: trace_id, parent_context: parent_context, links: links, name: name,
          kind: kind, attributes: attributes
        )
      end
    end
  end

jcarres-mdsol avatar Dec 10 '21 15:12 jcarres-mdsol

👋 This issue has been marked as stale because it has been open with no activity. You can: comment on the issue or remove the stale label to hold stale off for a while, add the keep label to hold stale off permanently, or do nothing. If you do nothing this issue will be closed eventually by the stale bot.

github-actions[bot] avatar Apr 27 '23 01:04 github-actions[bot]