dd-trace-rb icon indicating copy to clipboard operation
dd-trace-rb copied to clipboard

Possible Memory Issue

Open bmalinconico opened this issue 5 months ago • 5 comments

Current behaviour I have found that enabling auto-compaction in the Ruby GC is causing what appears to be random memory related bugs. This bug manifests in the following way:

Manifestation 1

TypeError: wrong argument type XXX (expected PG::Connection)

usr/src/app/vendor/bundle/ruby/3.3.0/gems/datadog-2.3.0/lib/datadog/tracing/contrib/pg/instrumentation.rb:27:in `exec': wrong argument type Set (expected PG::Connection) (TypeError)
	
from /usr/src/app/vendor/bundle/ruby/3.3.0/gems/datadog-2.3.0/lib/datadog/tracing/contrib/pg/instrumentation.rb:27:in `block in exec'
	
from /usr/src/app/vendor/bundle/ruby/3.3.0/gems/datadog-2.3.0/lib/datadog/tracing/contrib/pg/instrumentation.rb:145:in `block in trace'
	
from /usr/src/app/vendor/bundle/ruby/3.3.0/gems/datadog-2.3.0/lib/datadog/tracing/trace_operation.rb:206:in `block in measure'

Where XXX is any random built-in or app specific class.

This is raised on a call to exec or exec_params in the DD Postgres instrumentation.

Manifestation 2 This error is produced by the reproduction steps I will provide later.

PG::InvalidDatetimeFormat: ERROR:  invalid input syntax for type date: ""
CONTEXT:  unnamed portal parameter $1 = ''

I patched the DD PG instrumentation for exec_params and rescued the error with a pry session. The params array contained no empty values and a retry of the block succeeded

Manifestation 3 Occasional segfaults.

All of these errors feel like something is holding a memory reference that is being moved, resulting in random garbage getting passed down the stack and occasionally referencing a freed memory location.

Expected behaviour Not an error!

Steps to reproduce I was unable to reproduce on my local machine but a local containerized env may be able to reproduce it. I was only able to reproduce this in a container running on EC2, that machine is Linux x86_64.

Dockerfile to reproduce this image

FROM ruby:3.3.4
RUN apt-get update && apt-get install libjemalloc2 && rm -rf /var/lib/apt/lists/*
ENV LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2

My running application is able to reproduce this error easily when I have the compacting garbage collector enabled due to the volume of activity. Reproducing this in a shell is much more time consuming as you need to (presumably) wait for some compaction.

I'll also acknowledge this may not be datadog, but I've tried to narrow it down as much as I can.

ENV['DATABASE_URL'] = 'setme'
require 'pg'
require 'datadog'
GC.auto_compact = true

Datadog.configure do |c|
  c.tracing.instrument :pg
end

conn = PG.connect(ENV.fetch('DATABASE_URL', nil))
loop do
  conn.exec_params("SELECT #{1_664.times.map { |i| "$#{i + 1}::date as f_#{i}" }.join(',')}", 1_664.times.map { Date.today })
  print '.'
end

I'm going to reiterate that reproducing this is annoying, since there is no small amount of luck trying to get a compacting GC run to trigger at the right time. Doing the above in concurrent fibers increased the odds of it happening (probably due to increased memory churn) however I am providing the smallest repo I can.

Environment

  • datadog version: Currently 2.3.0 but I was upgrading to 1.2.3 and enabling the profiler. When I downgraded to 1.2.3 the issue was still present if I turned on auto_compact
  • Configuration block (Datadog.configure ...):
  • Ruby version: ruby 3.3.4 (2024-07-09 revision be1089c8ec) +YJIT [x86_64-linux]
  • Operating system: Linux
  • Relevant library versions: PG - 1.5.6

bmalinconico avatar Aug 30 '24 15:08 bmalinconico