dd-trace-rb icon indicating copy to clipboard operation
dd-trace-rb copied to clipboard

Propagate trace through Concurrent::Promises.future

Open lloeki opened this issue 3 years ago • 12 comments

Argument passing is preserved through the composited executor.

Concurrent::Future has been moved to a prepend style to resolve an infinite call loop situation.

lloeki avatar May 26 '21 11:05 lloeki

Duh, one lousy test on my part. Will fix.

lloeki avatar May 26 '21 11:05 lloeki

avoid (potential?) infinite recursion

That triggered during a rspec run, so unless the behaviour is spec-induced it had a chance of happening for real.

Having all these changes together made it quite hard for me to review the PR, as I had to separate them out for myself to understand why each change was there and why it was needed.

It may not be worth doing the separation after the fact for this PR (up to you), but I'd really like to see such a separation in the future, as it would make non-trivial changes such as this one a lot easier to review.

I agree. It turns out the order of discovering the issues and corresponding fixes was basically reversed from what would be a sensible sequence of changes for review, and by that point it wasn't quite straightforward to retcon these into non-broken individual commits, which I feel sorry for not doing.

I tried to make the code comments relevant, but I could definitely have been more explicit in the commit message/PR description though.

lloeki avatar May 27 '21 13:05 lloeki

That triggered during a rspec run, so unless the behaviour is spec-induced it had a chance of happening for real.

I'd be curious to see what happened -- was the patch applied more than once? Just to make sure we're fixing it properly with the prepend version.

I agree. It turns out the order of discovering the issues and corresponding fixes was basically reversed from what would be a sensible sequence of changes for review, and by that point it wasn't quite straightforward to retcon these into non-broken individual commits, which I feel sorry for not doing.

No worries, there's always the ideal and then there's reality. Hope you don't mind if I poke when it made my life harder, and please do the same and get revenge on my PRs as well :)

ivoanjo avatar May 27 '21 13:05 ivoanjo

I'd be curious to see what happened -- was the patch applied more than once? Just to make sure we're fixing it properly with the prepend version.

Yup, now that you mention it, the after step wasn't cleaning up correctly (now fixed), so the replaced method was the same as the replacing, and it looped. It could also happen in the now classic case of someone else using prepend on the same method, or a variety of other situations involving inheritance, for which the solution would be to use prepend through and through anyway.

Hope you don't mind if I poke when it made my life harder

No offense taken! It was definitely called for :)

lloeki avatar May 27 '21 14:05 lloeki

The JRuby 9.2.0.0 test failure looks like it is deterministic -- https://app.circleci.com/pipelines/github/DataDog/dd-trace-rb/3520/workflows/c571f62b-cf7f-44ed-bd05-2d134b54a8a8/jobs/138556/tests#failed-test-0 .

We have 9.2.0.0 because of #1409 but to be honest I'm not entirely a fan. At most, I'd suggest just skipping support for whatever is broken on those JRuby versions, and printing out a warning "did not patch concurrent ruby... your JRuby version is too old, please update".

ivoanjo avatar May 27 '21 14:05 ivoanjo

Regarding the lint failure, rebasing with master will fix it now.

marcotc avatar May 27 '21 18:05 marcotc

Rebased!

lloeki avatar May 28 '21 14:05 lloeki

Can you look into https://github.com/DataDog/dd-trace-rb/pull/1522#issuecomment-849703453 as well?

ivoanjo avatar May 28 '21 14:05 ivoanjo

Can you look into #1522 (comment) as well?

I was just typing it out ;)

I'm not entirely a fan

I wasn't aware of this PR, and I'm not either, not the least because I had to face non-conformant breakage already on JRuby. TBH I'd rather officially support only latest for each minor, given the regular stream of "oops we were not quite conformant on that not-so-corner-case" fixes that regularly land at JRuby. Not my call though.

At most, I'd suggest just skipping support for whatever is broken on those JRuby versions, and printing out a warning "did not patch concurrent ruby... your JRuby version is too old, please update".

Is there a sanctioned DSL-ish way to do this, or is this kind of thing made of hardcoded checks?

lloeki avatar May 28 '21 14:05 lloeki

I wasn't aware of this PR, and I'm not either, not the least because I had to face non-conformant breakage already on JRuby. TBH I'd rather officially support only latest for each minor, given the regular stream of "oops we were not quite conformant on that not-so-corner-case" fixes that regularly land at JRuby. Not my call though.

Pfft I'll see your issue and raise you this one 🤣

For real though, since this looks like one of those "it breaks your app" kind of deal, I think it's worth doing the small detour, since people are known to get stuck on random versions and not update them.

(But I also don't think we should spend too much time searching for new ones -- e.g. let's not add all minor versions of JRuby and MRI to CI)

Is there a sanctioned DSL-ish way to do this, or is this kind of thing made of hardcoded checks?

#1503 had a few examples that got removed recently. My specific suggestion would be to: find out which JRuby version fixed it (binary search 9.2 releases?) and then do something like

(in concurrent_ruby/integration.rb)

        def self.compatible?
          super && version >= MINIMUM_VERSION && !broken_jruby?
        end

ivoanjo avatar May 28 '21 14:05 ivoanjo

Thanks Ivo, I'll look into these examples and attempt something.

lloeki avatar May 28 '21 15:05 lloeki

@lloeki any update on this one?

ivoanjo avatar Jul 13 '22 10:07 ivoanjo

We faced with the same issue. We use the graphql-ruby with a new usage of Concurrent::Promises.future and our spans are not showing in a span list.

Stashchenko avatar Aug 01 '23 11:08 Stashchenko

We have added a monkey patch into our project.

  • ruby 3.2.2
  • ddtrace (1.14.0)
  • concurrent-ruby (1.2.2)
  • concurrent-ruby-ext (1.2.2)
# frozen_string_literal: true

require 'datadog/tracing/contrib/concurrent_ruby/configuration/settings'
require 'concurrent/executor/executor_service'

module Datadog
  module Contrib
    module ConcurrentRuby
      # This patches the Future - to wrap executor service using ContextCompositeExecutorService
      class ContextCompositeExecutorService
        extend Forwardable
        include Concurrent::ExecutorService

        attr_accessor :composited_executor

        def initialize(composited_executor)
          @composited_executor = composited_executor
        end

        def post(*, &block)
          tracer = Tracing.send(:tracer)
          parent_context = tracer.provider.context

          executor = @composited_executor.is_a?(Symbol) ? Concurrent.executor(@composited_executor) : @composited_executor

          # Pass the original arguments to the composited executor, which
          # pushes them (possibly transformed) as block args
          executor.post(*) do |*block_args|
            begin
              original_context = tracer.provider.context
              tracer.provider.context = parent_context

              # Pass the executor-provided block args as they should have been
              # originally passed without composition, see ChainPromise#on_resolvable
              yield(*block_args)
            ensure
              # Restore context in case the current thread gets reused
              tracer.provider.context = original_context
            end
          end
        end

        delegate [:can_overflow?, :serialized?] => :composited_executor
      end

      module PromisesFuturePatch
        def future_on(default_executor, *args, &task)
          unless default_executor.is_a?(ContextCompositeExecutorService)
            default_executor = ContextCompositeExecutorService.new(default_executor)
          end

          super(default_executor, *args, &task)
        end
      end

      class Integration
        include Datadog::Tracing::Contrib::Integration

        register_as :concurrent_ruby

        def self.version
          Gem.loaded_specs['concurrent-ruby'] && Gem.loaded_specs['concurrent-ruby'].version
        end

        def self.loaded?
          # Concurrent::Future is deprecated in favour of Concurrent::Promises::Future
          !defined?(::Concurrent::Promises::Future).nil? || !defined?(::Concurrent::Future).nil?
        end

        def self.compatible?
          true
        end

        def default_configuration
          Datadog::Tracing::Contrib::ConcurrentRuby::Configuration::Settings.new
        end

        def patcher
          Patcher
        end
      end

      module FuturePatch
        def ns_initialize(value, opts)
          super(value, opts)

          @executor = ContextCompositeExecutorService.new(@executor)
        end
      end

      module Patcher
        include Datadog::Tracing::Contrib::Patcher

        module_function

        def target_version
          Integration.version
        end

        def patch
          patch_future
          patch_promises_future
        end

        # Propagate tracing context in Concurrent::Future
        def patch_future
          ::Concurrent::Future.prepend(FuturePatch) if defined?(::Concurrent::Future)
        end

        # Propagate tracing context in Concurrent::Promises::Future
        def patch_promises_future
          ::Concurrent::Promises.singleton_class.prepend(PromisesFuturePatch) if defined?(::Concurrent::Promises::Future)
        end
      end
    end
  end
end

Works ok (we can see all spans): Example how to test with graphql-ruby:

   # created a Promise.future 
   Concurrent::Promises.future do
         # ex. faraday call    
   end

   # define schema lazy_resolve for Future
   lazy_resolve(Concurrent::Promises::Future, :value!)

image

Stashchenko avatar Aug 30 '23 13:08 Stashchenko

@ivoanjo could you apply this monkey patch to the official release ? We tested on a project for a while and it works nice.

Stashchenko avatar Sep 25 '23 10:09 Stashchenko

@Stashchenko do you mind opening an issue for it? I can use it to chase down a few people to look at it :)

ivoanjo avatar Sep 26 '23 13:09 ivoanjo

@Stashchenko do you mind opening an issue for it? I can use it to chase down a few people to look at it :)

We have already issue https://github.com/DataDog/dd-trace-rb/issues/1316

Stashchenko avatar Sep 26 '23 15:09 Stashchenko

I've requested the team take another look at this! Hopefully they'll be able to reply soon :)

ivoanjo avatar Sep 28 '23 12:09 ivoanjo

The JRuby 9.2.0.0 test failure looks like it is deterministic -- https://app.circleci.com/pipelines/github/DataDg/dd-trace-rb/3520/workflows/c571f62b-cf7f-44ed-bd05-2d134b54a8a8/jobs/138556/tests#failed-test-0 .

We have 9.2.0.0 because of #1409 but to be honest I'm not entirely a fan. At most, I'd suggest just skipping support for whatever is broken on those JRuby versions, and printing out a warning "did not patch concurrent ruby... your JRuby version is too old, please update".

@ivoanjo @lloeki JRuby CI is currently passing (JRuby 9.2.0 is no longer tested), I think this is good to merge?

ekump avatar Oct 12 '23 16:10 ekump

@ekump LGTM, thanks for picking it up!

lloeki avatar Oct 13 '23 06:10 lloeki

Hey @Stashchenko, this change has just been released in 1.16.0. Let us know if this addresses your issue!

marcotc avatar Nov 03 '23 21:11 marcotc

Hey @Stashchenko, this change has just been released in 1.16.0. Let us know if this addresses your issue!

Just checked, Unfortunately, doesn't work :(
With monkey patch: image

With new 1.16 release: image

In the second screenshot there is not faraday trace :(

Our setup:

# frozen_string_literal: true

require 'datadog/statsd'
require 'ddtrace'
require 'datadog/appsec'
require 'faraday'

Datadog.configure do |c|
  service_name = 'my-service'
  c.service = service_name

  c.tracing.analytics.enabled = true
  c.tracing.instrument(:concurrent_ruby)
  c.tracing.instrument(:rails)
  c.tracing.instrument(:redis, service_name: "#{service_name}-redis")
  c.tracing.instrument(:active_support, cache_service: service_name)
  c.tracing.instrument(:active_record, service_name: "#{service_name}-postgres")
  c.tracing.instrument(:http, split_by_domain: true)
  c.tracing.instrument(:faraday, service_name:)

  c.runtime_metrics.enabled = true
  c.runtime_metrics.statsd = Datadog::Statsd.new('localhost', (ENV['STATSD_PORT'] || '8120').to_i)

  c.ci.enabled = ENV['CI'].present?
  c.ci.instrument :rspec

  c.appsec.instrument :rails
end

Stashchenko avatar Nov 06 '23 08:11 Stashchenko

I tried to experiment with our monkey patch, and what I've found: I've made only changes to the ContextCompositeExecutorService#post In initialize folder: ddtrace_concurrent_ruby_patch.rb

# frozen_string_literal: true

require 'concurrent/executor/executor_service'

module Datadog
  module Tracing
    module Contrib
      module ConcurrentRuby
        # This patches the Future - to wrap executor service using ContextCompositeExecutorService
        class ContextCompositeExecutorService
          extend Forwardable
          include Concurrent::ExecutorService

          attr_accessor :composited_executor

          def initialize(composited_executor)
            @composited_executor = composited_executor
          end

          # The post method runs the task within composited executor - in a
          # different thread. The original arguments are captured to be
          # propagated to the composited executor post method
          def post(*)
            tracer = Tracing.send(:tracer)
            parent_context = tracer.provider.context
            executor = @composited_executor
            executor = Concurrent.executor(@composited_executor) if @composited_executor.is_a?(Symbol)

            # Pass the original arguments to the composited executor, which
            # pushes them (possibly transformed) as block args
            executor.post(*) do |*block_args|
              original_context = tracer.provider.context
              tracer.provider.context = parent_context

              # Pass the executor-provided block args as they should have been
              # originally passed without composition, see ChainPromise#on_resolvable
              yield(*block_args)
            ensure
              # Restore context in case the current thread gets reused
              tracer.provider.context = original_context
            end
          end

          delegate %i[can_overflow? serialized?] => :composited_executor
        end
      end
    end
  end
end

And it started working: image

Without patch (not showing): image

Looks like the issue for us is somewhere in the def post(*)method when we get the tracer. I'm not aware how it works yet maybe it will be helpful for you. Otherwise, we have somewhere incompatible versions

Stashchenko avatar Nov 06 '23 09:11 Stashchenko

👋 @Stashchenko, with the 1.16 change, you can see that there's a "floating" block of tracing (the one with a big execute.graphql on the bottom right). This is because the instrumentation of Future is now thread safe.

With new 1.16 release: image

The original patch, where tracer.provider.context is manually passed to another concurrent-ruby context is not thread-safe. It's not possible for two threads to modify the same tracer.provider.context in a sensible way. We added a test that demonstrate the issues caused by sharing a tracer.provider.context amongst different Futures (you get arbitrary threads creating spans that parent another thread's spans): https://github.com/DataDog/dd-trace-rb/pull/1522/files#diff-c44ebabaf45a6be710afd7ccb798cb12e505e0f0cbd30293c980b52cd5c12121R105-R136

The new implementation actually represents the accurate trace flow in a parallel application, and won't cause span parenting issues.

In the second screenshot there is not faraday trace :(

Now this is unexpected, given this change has nothing to do with Faraday. Could you let us know how Faraday is related to the rest of the trace you provided? It's very intriguing that a separate integration stopped working with this change.

marcotc avatar Nov 07 '23 17:11 marcotc

👋 @Stashchenko, I was not able to reproduce your case.

Could you provide me something more to further investigate with the missing faraday span

The snippet I used to mimic your behavior:

require 'ddtrace'
require 'graphql'
require 'faraday'
require 'concurrent-ruby'
require 'ostruct'
require 'pry'

class GraphQL::Types::BaseObject < GraphQL::Schema::Object
end

class PersonType < GraphQL::Types::BaseObject
  field :name, String, null: false
end

class PostType < GraphQL::Types::BaseObject
  field :id, ID, null: false
  field :author, PersonType

  def author
    Concurrent::Promises.future do
      Faraday.get 'https://www.google.com'

      OpenStruct.new(id: object.author_id, name: object.author_id)
    end.value
  end
end

class QueryType < GraphQL::Schema::Object
  field :post, PostType do
    argument :id, ID
  end

  # Then provide an implementation:
  def post(id:)
    OpenStruct.new(id: id, author_id: id)
  end
end

class Schema < GraphQL::Schema
  query QueryType

  # define schema lazy_resolve for Future
  lazy_resolve(Concurrent::Promises::Future, :value)
end

Datadog.configure do |c|
  c.service = 'Bug Reproduction'
  c.diagnostics.debug = true
  c.tracing.instrument :faraday
  c.tracing.instrument :concurrent_ruby
  c.tracing.instrument :graphql, schemas: [Schema]
end

query_string = <<-GraphqlQueryString
{
  p1: post(id: 1) { author { name } }
  p2: post(id: 2) { author { name } }
  p3: post(id: 3) { author { name } }
}
GraphqlQueryString

result_hash = Datadog::Tracing.trace("outer_span") do
  Schema.execute(query_string)
end


TonyCTHsu avatar Nov 07 '23 17:11 TonyCTHsu

Hi @TonyCTHsu , you should not get the value just in the same method call.

Concurrent::Promises.future do
      Faraday.get 'https://www.google.com'

      OpenStruct.new(id: object.author_id, name: object.author_id)
end.value # This piece 

It's probably, immediately runs the query and nothing to do for lazy resolver.

One more step to reproduce. Try to get list of products with author's name:

query_string = <<-GraphqlQueryString
{
   posts { author { name } }
}
GraphqlQueryString

That what we have in our code. One more step I don't know what exactly does this line Faraday.get 'https://www.google.com' . Does it execute faraday get or wait on real method call? (aka proxy object).

Smt like that should call the Faraday

Concurrent::Promises.future do
      response = Faraday.get 'https://api.publicapis.org/entries'
      value = JSON.parse(response.body).sample 
      OpenStruct.new(id: object.author_id, name: value['API'])
end

Stashchenko avatar Nov 07 '23 22:11 Stashchenko

Hi @marcotc ,

Could you let us know how Faraday is related to the rest of the trace you provided?

It's not Faraday related :( Basically, any span I would like to create inside the Promise.future is not showing in a trace for us :(

Stashchenko avatar Nov 08 '23 10:11 Stashchenko

Hi guys , tried to reproduce @TonyCTHsu example on a clean project.

class PersonType < BaseObject
      field :name, String, null: false
    end

  class PostType <  BaseObject
    field :author, PersonType
    field :id, ID, null: false

    def author
      Concurrent::Promises.future do
        Datadog::Tracing.trace('example.trace') do
          response = Faraday.get 'https://api.publicapis.org/entries'
          value = JSON.parse(response.body)['entries'].sample
          OpenStruct.new(id: object.author_id, name: value['API'])
        end
      end
    end
  end

  field :posts, [PostType], null: false

  def posts
    Array.new(10) { OpenStruct.new(id: SecureRandom.uuid, author_id: SecureRandom.uuid) }
  end

And result is: 🔥 image

Looks like something incompatible inside our project. Will dig deeper to compare what's the diff. Sorry for the confusion. And thanks a lot for support!

Stashchenko avatar Nov 08 '23 11:11 Stashchenko