dd-trace-rb
dd-trace-rb copied to clipboard
[BUG]: Updating to 2.14.0 lead to high memory and termination of karafka worker
Tracer Version(s)
2.14.0
Ruby Version(s)
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +YJIT +PRISM [arm64-darwin24]
Relevent Library and Version(s)
No response
Bug Report
We have a Rails app that uses the karafka gem for kafka message consumption.
We have been using the built-in karafka tracing described here
# Initialize the listener
dd_logger_listener = Karafka::Instrumentation::Vendors::Datadog::LoggerListener.new do |config|
config.client = Datadog::Tracing
end
Karafka.monitor.subscribe(dd_logger_listener)
We recently upgraded our datadog tracing version to 2.14.0. We did NOT enable the karafka tracing that this provides (i.e. we did NOT add c.tracing.instrument :karafka to our datadog initializer).
However, after deploying the app we started to see very high memory usage of the container running karafka, which ultimately led to the container continuously getting killed due to out-of-memory issues. We also detected traces recorded by the worker of upwards of 90 minutes.
Once we disabled the karafka instrumentation for traces above, memory usage returned to normal. However, we now have no traces, and were not expecting this datadog upgrade to be a breaking change for our application.
Is there a way we can disable any instrumentation that the datadog gem is providing, and just use the tracing packaged with the karafka gem?
Thanks
Reproduction Code
Add this to karafka.rb, without adding any karafka tracing in Datadog.configure
# Initialize the listener
dd_logger_listener = Karafka::Instrumentation::Vendors::Datadog::LoggerListener.new do |config|
config.client = Datadog::Tracing
end
Karafka.monitor.subscribe(dd_logger_listener)
Configuration Block
Datadog.configure do |c|
c.profiling.enabled = true
c.service = 'service'
c.version = ENV.fetch('GIT_SHA', 'none')
c.env = Rails.env
c.diagnostics.debug = false
c.diagnostics.startup_logs.enabled = false
c.telemetry.enabled = true
c.tracing.log_injection = true
c.tracing.enabled = true
c.tracing.instrument :rails
c.tracing.instrument :concurrent_ruby
c.tracing.instrument :faraday, split_by_domain: true
c.tracing.instrument :graphql, service_name: 'ruby-graphql'
c.tracing.instrument :grpc, { distributed_tracing: true }
c.tracing.instrument :pg, {
comment_propagation: 'full',
}
c.tracing.instrument :rack
c.tracing.instrument :ethon, split_by_domain: true
c.logger.instance = Rails.logger
end
Error Logs
No response
Operating System
No response
How does Datadog help you?
No response
Putting aside the bloat issue itself, it seems based on what @wahlg told me, that the patches for tracing are applied even when datadog tracing is not in use (and only the internal karafka dd tracing is):
def inspect_class(klass)
puts "Class: #{klass}"
# Get included modules
puts "\nIncluded modules:"
klass.included_modules.each { |m| puts " #{m}" }
# Get ancestors (includes superclasses and included modules)
puts "\nAncestors (superclasses and included modules):"
klass.ancestors.each { |a| puts " #{a}" }
# Get the singleton class (eigenclass)
singleton = klass.singleton_class
# Get modules extended to the class (included in singleton class)
puts "\nExtended modules (included in singleton class):"
singleton.included_modules.each { |m| puts " #{m}" unless m == Class }
# For prepended modules, we need to check the ancestors
puts "\nPrepended modules (appear before the class in ancestors):"
prepended = []
klass.ancestors.each_with_index do |ancestor, i|
if ancestor == klass && i > 0
# All ancestors before this index are prepended
prepended = klass.ancestors[0...i]
break
end
end
prepended.each { |m| puts " #{m}" if m.is_a?(Module) && !m.is_a?(Class) }
end
inspect_class(Karafka::Messages::Messages)
inspect_class(Karafka::Instrumentation::Monitor)
results in:
Class: Karafka::Messages::Messages
Included modules:
Karafka::Pro::Cleaner::Messages::Messages
Datadog::Tracing::Contrib::Karafka::MessagesPatch
Enumerable
ActiveSupport::Dependencies::RequireDependency
PP::ObjectMixin
ActiveSupport::ToJsonWithActiveSupportEncoder
ActiveSupport::Tryable
JSON::Ext::Generator::GeneratorMethods::Object
Kernel
Ancestors (superclasses and included modules):
Karafka::Pro::Cleaner::Messages::Messages
Datadog::Tracing::Contrib::Karafka::MessagesPatch
Karafka::Messages::Messages
Enumerable
ActiveSupport::Dependencies::RequireDependency
Object
PP::ObjectMixin
ActiveSupport::ToJsonWithActiveSupportEncoder
ActiveSupport::Tryable
JSON::Ext::Generator::GeneratorMethods::Object
Kernel
BasicObject
Extended modules (included in singleton class):
ActiveSupport::DescendantsTracker::ReloadedClassesFiltering
Zeitwerk::ConstAdded
Module::Concerning
ActiveSupport::Dependencies::RequireDependency
PP::ObjectMixin
ActiveSupport::ToJsonWithActiveSupportEncoder
ActiveSupport::Tryable
JSON::Ext::Generator::GeneratorMethods::Object
Kernel
Prepended modules (appear before the class in ancestors):
Karafka::Pro::Cleaner::Messages::Messages
Datadog::Tracing::Contrib::Karafka::MessagesPatch
Class: Karafka::Instrumentation::Monitor
Included modules:
Datadog::Tracing::Contrib::Karafka::Monitor
ActiveSupport::Dependencies::RequireDependency
PP::ObjectMixin
ActiveSupport::ToJsonWithActiveSupportEncoder
ActiveSupport::Tryable
JSON::Ext::Generator::GeneratorMethods::Object
Kernel
Ancestors (superclasses and included modules):
Datadog::Tracing::Contrib::Karafka::Monitor
Karafka::Instrumentation::Monitor
Karafka::Core::Monitoring::Monitor
ActiveSupport::Dependencies::RequireDependency
Object
PP::ObjectMixin
ActiveSupport::ToJsonWithActiveSupportEncoder
ActiveSupport::Tryable
JSON::Ext::Generator::GeneratorMethods::Object
Kernel
BasicObject
Extended modules (included in singleton class):
ActiveSupport::DescendantsTracker::ReloadedClassesFiltering
Zeitwerk::ConstAdded
Module::Concerning
ActiveSupport::Dependencies::RequireDependency
PP::ObjectMixin
ActiveSupport::ToJsonWithActiveSupportEncoder
ActiveSupport::Tryable
JSON::Ext::Generator::GeneratorMethods::Object
Kernel
Prepended modules (appear before the class in ancestors):
Datadog::Tracing::Contrib::Karafka::Monitor
with:
Datadog::Tracing::Contrib::Karafka::MessagesPatch
Datadog::Tracing::Contrib::Karafka::Monito
being present.
Hey @wahlg thanks for reporting this.
I would like to ask you about the auto instrumentation, do use have something like this in your Gemfile
gem 'datadog', require: 'datadog/auto_instrument'
or just in the code
require 'datadog/auto_instrument'
Hi @Strech this is the current entry in the Gemfile:
gem 'datadog', '~> 2.0', require: 'datadog/auto_instrument'
I added the configuration described in this issue to the test app in https://github.com/p-datadog/karafka-test and I do see what appears to be constant growth in memory consumed by the server process, however:
Datadog.configure do |c|
c.profiling.enabled = true
c.service = 'service'
c.version = ENV.fetch('GIT_SHA', 'none')
c.env = Rails.env
c.diagnostics.debug = false
c.diagnostics.startup_logs.enabled = false
c.telemetry.enabled = true
c.tracing.log_injection = true
c.tracing.enabled = true
c.tracing.instrument :rails
c.tracing.instrument :concurrent_ruby
c.tracing.instrument :faraday, split_by_domain: true
c.tracing.instrument :graphql, service_name: 'ruby-graphql'
c.tracing.instrument :grpc, { distributed_tracing: true }
c.tracing.instrument :pg, {
comment_propagation: 'full',
}
c.tracing.instrument :rack
c.tracing.instrument :ethon, split_by_domain: true
c.logger.instance = Rails.logger
end
That turns on a lot of functionality and if I, for example, remove profiling from this list the growth in process memory size appears to slow down.
With the upstream instrumentation removed I think there is still continuous growth in process memory size but it is very slow, possibly just due to heap fragmentation?
With the upstream instrumentation removed I think there is still continuous growth in process memory size but it is very slow, possibly just due to heap fragmentation?
Karafka has no known memory leaks and I've been running 20-30k msg/s processing to debug other stuff lately. The heap fragmentation may occur but it is very slow and stabilizes after a while (depending on processing nature and payloads, etc).
Though as far as I know, the thing is, with updated (patched?) dd things are much worse.
My recent run of 10-20k msg/s increases memory usage by 9.5MB over period of 5 hours.
P.S. By no means I say it is not Karafka related ofc and if this is the case I will fix it asap.
I do not work on tracing part of the library and could be wrong, but perhaps the issue is that https://github.com/DataDog/dd-trace-rb/blob/master/lib/datadog/tracing/contrib/karafka/monitor.rb#L21 uses the block form of trace which I assume is guaranteed to clean up (finish) every span that is started, but https://github.com/karafka/karafka/blob/master/lib/karafka/instrumentation/vendors/datadog/logger_listener.rb#L51 uses non-block form and https://github.com/karafka/karafka/blob/master/lib/karafka/instrumentation/vendors/datadog/logger_listener.rb#L79 may or may not finish the trace that is started on line 51. If there are two instrumentations active (which I guess is expected if auto_instrument is required), perhaps the non-block span gets orphaned which then causes a memory leak.
For some more context, the task where we are running karafka has 2GB of available memory. Prior to upgrading the datadog gem, when running karafka in production with all the other datadog instrumentation enabled (profiling, tracing of other dependencies, etc) the memory usage hovers around 40%.
On the datadog gem version 2.14.0, if I enable karafka tracing through karafka.rb, memory grows linearly from the moment the task starts until it reaches 100% and is ultimately killed due to out-of-memory.
gem 'datadog', '~> 2.0', require: 'datadog/auto_instrument'
This requests datadog to instrument everything it knows how to instrument which includes karafka.
If you do not wish to instrument everything by default, you can remove the auto_instrument require and configure integrations manually via Datadog.configure.
I ran the above code inspecting included modules and it does not show Datadog::* with
ruby -Ilib -rbundler -rkarafka -rdatadog test.rb
And shows the output provided above with Datadog modules added to Karafka with
ruby -Ilib -rbundler -rkarafka -rdatadog/auto_instrument test.rb
(The bundler require is needed due to https://github.com/karafka/karafka/issues/2590 and I had to manually bypass the Rails-delayed activation with
diff --git a/lib/datadog/tracing/contrib/auto_instrument.rb b/lib/datadog/tracing/contrib/auto_instrument.rb
index 23988c2e2f..76a4e74412 100644
--- a/lib/datadog/tracing/contrib/auto_instrument.rb
+++ b/lib/datadog/tracing/contrib/auto_instrument.rb
@@ -12,7 +12,7 @@ module Datadog
require_relative '../../core/contrib/rails/utils'
# Defer to Rails if this is a Rails application
- if Datadog::Core::Contrib::Rails::Utils.railtie_supported?
+ if Datadog::Core::Contrib::Rails::Utils.railtie_supported? && false
require_relative 'rails/auto_instrument_railtie'
else
AutoInstrument.patch_all!
Since it looks like karafka is bringing in Rails?
@p-datadog that makes sense. Is there a way to enable auto-instrumention but exclude specific items? Or is there a way to list everything that is enabled via auto-instrumentation that might not be explicitly configured in the datadog initializer? I just want to make sure I don't inadvertently disable something that was previously enabled.
Currently with the datadog auto-instrumentation enabled, and the karafka.rb config for datadog traces disabled, we are seeing normal memory usage and traces being recorded for karafka consumer processes. So I think we are in good shape now.
Thanks all for the help on this issue
@p-datadog in such case, WDYT about helping me figuring out how to detect that DD Karafka is in use so I can issue a warning to users when they use both?
@mensfeld You can use Karafka patcher interface to track was it patched or not
Datadog::Tracing::Contrib::Karafka::Patcher.patched? # => true/false
@wahlg We already have some discussion around auto-instrument and configuration compatibility. But technically, you should not experience high-memory usage at the first place, no matter how you instrument. I think we will launch investigation about this contrib (but I can't guarantee the priority of it).
Glad that the issue is resolved, but I would not close the ticket in order to track it.
@mensfeld What about storing the span created in https://github.com/karafka/karafka/blob/master/lib/karafka/instrumentation/vendors/datadog/logger_listener.rb#L51 in thread/fiber-local storage so that it can be finished later? Would that work?
FYI I added docs to ensure that users do not use the karafka native instrumentation alongside DD provided. I will add patch detection as well as anyhow users should select one and not mix them.
Ah @p-datadog to answer your question: I would not mix those capabilities (Karafka native vs DD native) and would rather raise error / notice on ppl doing that than try to glue them together. I see a lot of value in having them independent as I do not control DD release cycles and you do not control mine ;)
I see a lot of value in having them independent as I do not control DD release cycles and you do not control mine ;)
...+ often folks lag... "a bit"... on upgrading the datadog gem ;)
Thanks for looking into this @mensfeld 👍 🙇