dd-trace-rb icon indicating copy to clipboard operation
dd-trace-rb copied to clipboard

Memory leak and memory bloat in resque processes when profiling is enabled (suspected to impact at least 0.54.2 and above)

Open jorgemanrubia opened this issue 3 years ago • 23 comments

When we upgraded to version 1.0 our resque processes started to leak memory and, sporadically, we registered some sudden multi-GB memory bloats:

CleanShot 2022-05-25 at 12 53 20@2x

I disabled the resque and active_job instrumentation integrations but it was still leaking, so not really sure it's related to those specifically. The processes leaking memory were the resque workers for sure. Once reverted to the previous gem, things went back to normal.

CleanShot 2022-05-25 at 12 55 37@2x

This was in a beta server that had barely any activity.

jorgemanrubia avatar May 25 '22 11:05 jorgemanrubia

Thanks for the report. Is there anything you can share to help us replicate this? (Config, Gemfile etc?)

delner avatar May 25 '22 13:05 delner

@delner 👍 I’ll prepare something I can share here.

jorgemanrubia avatar May 25 '22 14:05 jorgemanrubia

I found something interesting when trying to come up with a minimal setup to share: what causes the leak is to enable profiling. So I can reproduce with this setup:

Datadog.configure do |config|
  config.service = "bc3"
  config.env = Rails.env

  config.tracing.instrument :rails
  config.tracing.instrument :resque

  config.profiling.enabled = true
  config.tracing.enabled = true
end

But the problem gets solved if I remove this line:

config.profiling.enabled = true
CleanShot 2022-05-26 at 11 27 07@2x

As mentioned, we don't have the same problem with v0.54.2 and having profiling enabled.

I can't share our full Gemfile but here are some relevant gems and versions we use related to Rails, resque, elasticsearch and AWS.

jorgemanrubia avatar May 26 '22 09:05 jorgemanrubia

Hey 👋 . I work on the profiler bits so it seemed relevant to step in.

Between 0.54.2 and 1.0.0 there were very little changes to the profiler, so perhaps that will make it easier to track it down. 😅

~Could you try adding config.profiling.advanced.code_provenance_enabled = false to your Datadog.configure block?~ (Update, late 2022: This was actually not the source of the issue, it was a "red herring")

That will disable a new feature we added in 1.0.0 (#1813) and check if that's the culprit.

ivoanjo avatar May 26 '22 09:05 ivoanjo

Hey @ivoanjo, it totally looks like that. After setting config.profiling.advanced.code_provenance_enabled = false the memory leak problem seems solved. I'll leave this running for some time to be sure and get back, but almost sure that's it from the first minutes 👍.

jorgemanrubia avatar May 26 '22 10:05 jorgemanrubia

Confirmed it's that feature @ivoanjo:

CleanShot 2022-05-26 at 12 44 57@2x

jorgemanrubia avatar May 26 '22 10:05 jorgemanrubia

Cool, glad we could narrow this down so quickly. This will help a lot in getting out a fix. Thanks again for the report and your help @jorgemanrubia!

delner avatar May 26 '22 13:05 delner

+1 thanks @jorgemanrubia for the fast feedback on the experiment! It's good that we seem to have found the culprit and that hopefully this can unblock you to run dd-trace-rb 1.0.0 (and 1.1.0, just released!).

I'd like to investigate this further. Could I ask you to open a ticket with support via https://docs.datadoghq.com/help/ linking to this issue and include the following (if possible):

  • The Ruby version you're using (ruby -v)
  • The rubygems and bundler version you're using (bundle -v and bundle exec gem -v)
  • If possible a link to the affected service's profile

ivoanjo avatar May 26 '22 13:05 ivoanjo

Yep - production affecting incident for us too.

rgrey avatar May 26 '22 15:05 rgrey

Could I ask you to open a ticket with support via https://docs.datadoghq.com/help/ l

@ivoanjo I just opened it (#821036)

jorgemanrubia avatar May 26 '22 15:05 jorgemanrubia

@rgrey thanks for the report. Are you seeing this issue in your resque workers as well, or is it unrelated to resque?

~~Edit: And does adding config.profiling.advanced.code_provenance_enabled = false to your Datadog.configure block solve the issue?~~ (Update, late 2022: This was actually not the source of the issue, it was a "red herring")

ivoanjo avatar May 26 '22 16:05 ivoanjo

@rgrey thanks for the report. Are you seeing this issue in your resque workers as well, or is it unrelated to resque?

Edit: And does adding config.profiling.advanced.code_provenance_enabled = false to your Datadog.configure block solve the issue?

Currently in the process of checking to see if this workaround works. For us, we don't use resque. Our rake tasks are failing, but those with more experience of our codebase are looking further to try understand/isolate.

rgrey avatar May 27 '22 09:05 rgrey

Thanks for the extra info @rgrey. I'm investigating and attempting to reproduce the issue, but have not had luck so far. Any information you can share is very helpful.

I would also like to know if for you the issue also showed up on a 0.54 to 1.0/1.1 migration.

If possible, I'll also request that you open a ticket with support and mention this github issue, so we can look into your profiles in more detail.

ivoanjo avatar May 27 '22 11:05 ivoanjo

Thanks for the extra info @rgrey. I'm investigating and attempting to reproduce the issue, but have not had luck so far. Any information you can share is very helpful.

I would also like to know if for you the issue also showed up on a 0.54 to 1.0/1.1 migration.

If possible, I'll also request that you open a ticket with support and mention this github issue, so we can look into your profiles in more detail.

We’ve reproduced the memory issue on a staging console running an internal rake task with ddtrace 1.0.0. Memory usage slowly increased and it died after half an hour or so. It still failed after adding config.profiling.advanced.code_provenance_enabled = false

Also noticed that ddtrace 1.0.0 adds a dependency on libddwaf gem. The changelog describes all releases since 2020 as “unstable”, which doesn’t really inspire confidence.

A new rabbit hole on 1.1.0:

WARN -- ddtrace: [ddtrace] Profiling was requested but is not supported, profiling disabled: Your ddtrace installation is missing support for the Continuous Profiler because there was a problem in setting up the libddprof dependency. For help solving this issue, please contact Datadog support at https://docs.datadoghq.com/help/.

I can’t see any documentation about this. It just tells you to contact support. Memory usage is growing again as well, so looks like the issue isn’t fixed. Removing the profiler doesn't help. Still leaking.

However, having removed the rake integration from config/initializers/datadog.rb:

    # Having rake enabled for the job workers means that datadog will not report job traces
    # because it has a open trace for the jobs:work rake task and it won't report until that is
    # closed by shutting down the job worker.
    c.tracing.instrument :rake,
      service_name: "freeagent-rake",
      enabled: !Rake.application.top_level_tasks.include?("jobs:work"),
      analytics_enabled: nil

This does seems to fix the memory issue, but of course, isn't the solution!

rgrey avatar Jun 06 '22 11:06 rgrey

We’ve reproduced the memory issue on a staging console running an internal rake task with ddtrace 1.0.0. Memory usage slowly increased and it died after half an hour or so.

Great, having a way to reproduce the issue really helps.

To clarify:

  1. Is the task you mentioned using rake to trigger the processing of resque tasks?
  2. When the failure occurred, was it during a single very long-running task, or was the Ruby app working on multiple smaller tasks for the half-hour period you mention?

It still failed after adding config.profiling.advanced.code_provenance_enabled = false Removing the profiler doesn't help. Still leaking.

Ok, this is quite interesting! This makes me think that your issue @rgrey may not be the same one that @jorgemanrubia, since disabling profiling seemed to help his setup.

A new rabbit hole on 1.1.0: WARN -- ddtrace: [ddtrace] Profiling was requested but is not supported, profiling disabled: (...)

Interesting! This will not contribute to the memory leak, but is annoying since it blocks you from using the profiler. I've created a separate issue to track and investigate this: https://github.com/DataDog/dd-trace-rb/issues/2068

Also noticed that ddtrace 1.0.0 adds a dependency on libddwaf gem. The changelog describes all releases since 2020 as “unstable”, which doesn’t really inspire confidence.

I believe still having libddwaf marked as unstable is an oversight. I'll discuss this with my colleagues and I'll get back to you on this.

Note that this dependency is lazy loaded and thus not even required unless the security-related features of ddtrace are turned on.

ivoanjo avatar Jun 06 '22 14:06 ivoanjo

Thanks @ivoanjo - will confirm details with an engineer tomorrow re tasks, but no resque for us.

We will be looking for the ASM component, as we are heavily dependent on Sqreen and need to protect continuity of protective capabilities by migrating to ASM and keeping Sqreen in play until 31 Mar 2023.

rgrey avatar Jun 06 '22 15:06 rgrey

@rgrey Looking at the above, I first seriously suspect the Rake instrumentation.

If a long running Rake task is instrumented (especially one that starts a worker process) then it can incorrectly create one trace for the entire application. This is because traces are thread-bound: there's up to one per thread, and they release memory after the original span finishes.

If a Rake span opens on a Rake task that runs indefinitely, and the main thread of that Rake task continuously enters/exits instrumented code, there's a high likelihood a trace will grow indefinitely until the max span limit is reached (100K spans), thereby leaking memory.

Best workaround is to disable Rake instrumentation on any long running Rake tasks (I see you had something like this in your configuration). We do have some mechanisms that could be employed to manage trace context or trace individual tasks, but I wouldn't recommend them until I understand if they would address your issue.

@rgrey Is there a simple scenario I can run on my side to reproduce this? Will help give me an idea on what to recommend or how we could change behavior. Thanks!

delner avatar Jun 06 '22 18:06 delner

Also noticed that ddtrace 1.0.0 adds a dependency on libddwaf gem. The changelog describes all releases since 2020 as “unstable”, which doesn’t really inspire confidence.

I believe still having libddwaf marked as unstable is an oversight. I'll discuss this with my colleagues and I'll get back to you on this.

libddwaf's "unstable" marker here means the API may change and have breaking changes on minor versions, while its codebase and resulting binaries are definitely production-ready. To fully clarify, libddwaf is a heir to libsqreen, the latter which was semantically versioned using 0.x. To mark the break between libsqreen and libddwaf (which involved a lot of renaming and changes), we decided to bump the major version, but we needed some time nonetheless to stabilise the public API. So In essence libddwaf's 1.x is operating following semver's usual 0.x (where minor means "breaking change" and patch means "bugfix").

Since libddwaf should not be used directly and is wrapped by bindings such as libddwaf-rb, any such low-level C API change is handled by Datadog internally and isolated by the higher level binding code, which aims to provide a much stabler high level Ruby-oriented API. In any case, libddwaf-rb dependency is directly consumed by the ddtrace gem, and should there be a breaking change in libddwaf-rb's API we would handle it as gracefully as technically possible at the ddtrace level, and properly handled using ddtrace's gemspec dependency verssion constraints so that it picks only compatible versions of libddwaf-rb.

For additional clarity, any pre-release grade version of libddwaf-rb would be marked using pre-release gem version components such as .beta1.

I will create a PR to propose some changes to libddwaf's README to clarify the versioning scheme and API stability expectations.

I hope that makes things clearer!

lloeki avatar Jun 07 '22 14:06 lloeki

I've been running some more tests and found some interesting things:

  • I confirm that the problem happens when profiling is enabled.
  • The problem happens when instrumenting resque. I thought that instrumenting rake could be a factor, because resque-pool uses a rake task to spawn the main process, resulting in that long-living process @delner mentioned above. But I can reproduce the problem also when disabling rake monitoring.
  • I can reproduce the problem with latest gem version AND also with 0.54.2.
  • With 1.1.0, I can reproduce the problem with and without code_provenance enabled, as long as profiling is enabled.

Sorry I initially associated the problem with upgrading to version 1.0 of the gem. I have updated the title to remove that. I think I observed inconsistent results because the problem is associated to exercising jobs in the queue. This time I used a script to keep enqueuing jobs in all my tests.

CleanShot 2022-06-10 at 12 20 24@2x

If a long running Rake task is instrumented (especially one that starts a worker process) then it can incorrectly create one trace for the entire application. This is because traces are thread-bound: there's up to one per thread, and they release memory after the original span finishes.

In production, with 1.1.0 without profiling, we have registered these long-running traces associated to the resque-pool rake task. We haven't registered these clear memory leaks, but we have observed large memory bloats. We've removed rake instrumentation to see if that helps with those:

CleanShot 2022-06-10 at 12 24 48@2x

Even if there is a technical explanation, I think Datadog should prevent the problem in those those long-running processes. Or, if not-too-long-lived processes are required for ddtrace not leaking memory, it should be documented and actively warned against IMO.

Thanks for the help with this one.

jorgemanrubia avatar Jun 10 '22 10:06 jorgemanrubia

I would like to confirm: is the metric showing on this graph you shared the total memory usage across all resque processes/workers, or just for a single one?

Since you were able to confirm that the issue is not a regression, but something that is triggered on 0.54.2 as well, the next step would be to understand what exactly is getting leaked.

Unfortunately dd-trace-rb can not (yet!) help there since our profiler only supports CPU and Wall-clock data, and does not have any heap profiling capabilities (yet!).

Would you be able to share the analysis results of a Ruby heap dump of one such process? The heap-profiler gem can be used as a nice shorthand for dumping the heap and showing a report explaining where the memory is being used.

(One other alternative is the newly-released ruby_memprofiler_pprof gem which produces a pprof file tracking where the objects being retained were allocated).

I'm also available to jumping on a live call so we can debug this together, if you're up for it.

Note that we/I don't recommend using neither heap-profiler nor ruby_memprofiler_pprof in production, but since you mentioned this is reproduceable in a beta/staging environment, the added overhead and potential impact of using these gems should not be an issue.

ivoanjo avatar Jun 10 '22 12:06 ivoanjo

@jorgemanrubia This detail and comparison is helpful, thank you.

We do have hard-caps on trace & buffer size (100K spans) which are supposed to mitigate the impact of large traces. However, they may be set too high, or the criteria for triggering the cap may be too coarse grained. We may want to re-evaulate this behavior.

As an experiment, you could tweak the default cap with Datadog::Tracing::TraceOperation.const_set(:DEFAULT_MAX_LENGTH, 1000), or set it anywhere below 100K, to see if it has any meaningful impact. That would tell us if this particular control is effective, or whether we need to go further.

delner avatar Jun 10 '22 14:06 delner

Hey folks I'll get more info about it this week. It's on my radar, I just need to allocate time for it.

jorgemanrubia avatar Jun 14 '22 10:06 jorgemanrubia

👋 @jorgemanrubia and @rgrey, thank you for patience here!

I was trying to find any possible points where ddtrace holds on to memory by simply running large Ruby applications (I tried GitLab and Discourse) with ddtrace enabled: a ObjectSpace.dump on both did not reveal any ddtrace objects being kept alive for longer than the duration of their processing. The memory usage also eventually stabilized.

This leads me back to your situation: the issue is likely a long-running Rake task. To be clear, any never-ending trace started by ddtrace will cause large memory accumulation: we have a hard limit of 100k spans, but, in my very simple local tests, such a large span occupied 460MiB of memory, and you can have one active span per Ruby thread. I don't see such a large span being sustainable on a production deployment.

I've talked to our team, and we'll add selective Rake instrumentation by default: you'll have to explicitly declare what tasks need to be monitored, instead of all tasks by default. Because instrumenting a never-ending Rake can cause critical issues, we are not comfortable having instrumentation that by default can cause your application to run out of memory. The explicit declaration will definitely be more cumbersome, but it's the safe thing to do using the tools we have today.

Until we have that in place, I recommend disabling the Rake instrumentation, if you haven't already. In the meantime, you can manually instrument any important Rake tasks you have today that you wish to maintain visibility on:

# Ensure tracer is configured for Rake tasks, if not configured elsewhere
require 'ddtrace'
Datadog.configure{...}

task :my_task do |t|
  # Using `rake.invoke` and task name as the `resource` matches the current instrumentation.
  # Your monitors and alerts will work unmodified.
  Datadog::Tracing.trace('rake.invoke', resource: t.name) do
    # Task work goes here
  end
end

I'll also look into reducing the hard limit of 100k spans: it's practically useless to have such a large trace today. The amount of data in each Datadog App page is overwhelming and the flamegraph becomes illegible. I'll trying to find a realistic hard limit amongst our current client usage.

Overall, I'd like us to find an approach that is proactive in preventing never-ending traces without affecting otherwise correctly instrumented long traces.

marcotc avatar Jul 15 '22 22:07 marcotc

This same thing happens for our Sidekiq workers. A single long running worker that did a lot of queries and iterated a lot of database objects caused this behaviour. After disabling profiling the memory consumption curve stays flat. The memory graph illustrates what happens when the worker is started, restarted, then restarted with profiling disabled.

Screenshot 2023-02-11 at 22 09 49

I spent a lot of time today trying to pin down this issue, and it seems that somehow, with this setting enabled, datadog holds on to a a bunch of activerecord objects.

   3.07 kB  connection_pool-2.3.0/lib/connection_pool.rb:61
   2.88 kB  connection_pool-2.3.0/lib/connection_pool.rb:64
   2.18 kB  sidekiq-7.0.0/lib/sidekiq/fetch.rb:49
   1.60 kB  redis-client-0.12.1/lib/redis_client/command_builder.rb:9
   1.34 kB  connection_pool-2.3.0/lib/connection_pool.rb:68
   1.34 kB  connection_pool-2.3.0/lib/connection_pool.rb:71
  928.00 B  mysql2-0.5.4/lib/mysql2/client.rb:148
  640.00 B  activesupport-7.0.4/lib/active_support/core_ext/numeric/deprecated_conversions.rb:6
  320.00 B  connection_pool-2.3.0/lib/connection_pool.rb:60
  320.00 B  redis-client-0.12.1/lib/redis_client/decorator.rb:27
   80.00 B  sidekiq-7.0.0/lib/sidekiq/launcher.rb:105
   40.00 B  activemodel-7.0.4/lib/active_model/attribute_set.rb:93
   40.00 B  connection_pool-2.3.0/lib/connection_pool/timed_stack.rb:63
   40.00 B  sidekiq-7.0.0/lib/sidekiq/launcher.rb:228

(snippet from a memory dump using https://github.com/SamSaffron/memory_profiler)

Running:

Datadog::Tracing::TraceOperation.const_set(:DEFAULT_MAX_LENGTH, 1000)

Made no difference.

erkie avatar Feb 11 '23 21:02 erkie

Hey @erkie, sorry that you're running into problems with memory usage.

May I ask you to open you report as a separate issue, so that we can investigate there? Unfortunately since github has no threading it becomes harder to discuss potentially-different problems that may have the same outcome (memory usage).

So that we can investigate, do mind also including the following:

  • Your version of dd-trace-rb
  • How you are disabling/enabling dd-trace-rb

ivoanjo avatar Feb 14 '23 13:02 ivoanjo

I'm experiencing what looks like a similar problem, but with sidekiq. Opened #2930 for better threading of the conversation, but am linking here in case some other traveler finds it useful.

cluesque avatar Jun 23 '23 22:06 cluesque

We started experiencing similar issues after upgrading from Datadog 0.54.2 to 1.0.0. It was primarily affecting a long-running data synchronisation process (implemented as a Rake task) that could allocate up to 8 GB (maximum that we allowed) in 6-8 hours. There were also OOM errors in our Que workers that started happening, but I can't prove that it was related.

After we upgraded to Datadog 1.12.1 the issue was gone 🎉

I find it really funny that I spent the last week using Datadog to troubleshoot memory issues that were caused by Datadog 😆

adamniedzielski avatar Jul 07 '23 11:07 adamniedzielski

Happy to hear that you're not impacted anymore :tada:

Please do open a ticket if you ever run into any issue -- we're glad to look into it :)

ivoanjo avatar Jul 07 '23 14:07 ivoanjo

I'm experiencing what looks like a similar problem, but with sidekiq. Opened #2930 for better threading of the conversation, but am linking here in case some other traveler finds it useful.

Thanks a lot for linking the issue here. This helped me as well (tracing.partial_flush.enabled = true).

peterfication avatar Jul 13 '23 06:07 peterfication

Hey! We were reviewing old tickets and bumped into this one.

There's been quite a number of releases/fixes to ddtrace since this was discussed, and for a few folks it looks like partial_flush is the way to go to make sure huge traces get broken up into smaller chunks that get evicted.

@jorgemanrubia and any folks that run into any issue -- please do let us know by opening new issues if you see anything suspicious -- we definitely want to look into them :pray: :smile:

ivoanjo avatar Jan 03 '24 14:01 ivoanjo