dd-trace-go icon indicating copy to clipboard operation
dd-trace-go copied to clipboard

tracer: too much memory consumption

Open prithvihv opened this issue 3 years ago • 6 comments

(pprof) top
Showing nodes accounting for 106.21MB, 88.51% of 120MB total
Dropped 40 nodes (cum <= 0.60MB)
Showing top 10 nodes out of 66
      flat  flat%   sum%        cum   cum%
   21.50MB 17.92% 17.92%    21.50MB 17.92%  gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.(*span).setMeta (inline)
   19.50MB 16.25% 34.18%    21.50MB 17.92%  runtime/pprof.WithLabels
      15MB 12.50% 46.68%       15MB 12.50%  gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.(*span).setMetric
      12MB 10.00% 56.68%    77.79MB 64.82%  gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.(*tracer).StartSpan
   10.36MB  8.64% 65.32%    10.86MB  9.05%  github.com/gamezop/***/pkg/***
       8MB  6.67% 71.98%    18.86MB 15.72%  github.com/gamezop/***/pkg/***
    6.01MB  5.01% 76.99%     6.01MB  5.01%  github.com/DataDog/datadog-go/v5/statsd.newStatsdBuffer
    5.82MB  4.85% 81.84%     5.82MB  4.85%  github.com/gamezop/***/pkg/***
    4.50MB  3.75% 85.59%     5.27MB  4.39%  gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.newSpanContext
    3.50MB  2.92% 88.51%     3.50MB  2.92%  runtime.allocm

datadog traces is consuming more memory than application itself.

relevant code in question

        txn, err := db.Begin()

	if errr.IfErrAndLog(err, l, "failed to create transaction") {
		return err
	}

	stmt, err := txn.Prepare(pq.CopyIn("bid", bidFields()...))
	if errr.IfErrAndLog(err, l, "failed to prepare stmt") {
		return err
	}

	for _, bid := range bids {
		_, err = stmt.ExecContext(ctx,
			bid.BidOrderId,
			bid.IsBuy,
			IntToDecimal(bid.Price),
			bid.Qty,
			bid.QuestionId,
			"unmatched",
		)
		if errr.IfErrAndLog(err, l, "failed to exec context") {
			return err
		}
	}
	_, err = stmt.Exec()
	if errr.IfErrAndLog(err, l, "failed to exec to push all data") {
		return err
	}

	err = stmt.Close()
	if errr.IfErrAndLog(err, l, "failed to close stmt") {
		return err
	}

	err = txn.Commit()
	if errr.IfErrAndLog(err, l, "failed to commit") {
		return err
	}

	return err

prithvihv avatar Jun 27 '22 15:06 prithvihv

Hi, I also noticed the same thing with a SQL transaction of 9m row, the traces were using 600MB + of ram (it was still increasing but my server ran out of memory), The only way the transaction worked was the remove the sqltrace. is there any way to send partial spans to help with that problem? or something else?

eleboucher avatar Jul 08 '22 07:07 eleboucher

@prithvihv Sorry for the late reply and the issue you're experiencing.

Would it be possible for you (or @eleboucher) to share the pprof file for this? You can send it here https://help.datadoghq.com/hc/en-us/requests/new - mention my name and this issue, and then reply here?

My first guess is that this could be a goroutine leak. So a goroutine profile would also be useful.

19.50MB 16.25% 34.18% 21.50MB 17.92% runtime/pprof.WithLabels

A potential workaround you could try to get rid of this one is:

tracer.Start(
  // your options ...,
  tracer.WithProfilerCodeHotspots(false),
  tracer.WithProfilerEndpoints(false),
)

felixge avatar Jul 08 '22 17:07 felixge

Hey @felixge,

I sent you the requested file, you can see on my traces that sqlx.Rebind take most of the memory however when I disable dd-trace the memory is fine. I hope that can help you.

Edit: your workaround doesn't impact the memory

eleboucher avatar Jul 11 '22 13:07 eleboucher

@eleboucher thanks, I received the data! All profiles were generated while applying the same workload?

Edit: your workaround doesn't impact the memory

Looking at your data, I'm not surprised by this. Your issue might be different from @prithvihv. There is no runtime/pprof.WithLabel in your profiles.

Anyway, I'll ping you again once I know more.

felixge avatar Jul 11 '22 14:07 felixge

@eleboucher thanks, I received the data! All profiles were generated while applying the same workload?

Yes everything was made using the same workload

eleboucher avatar Jul 11 '22 14:07 eleboucher

@eleboucher I followed up with our support team and they should reach out to your shortly. But the gist is that tracing long-running transactions with many queries will create a large number of spans that can't be flushed to the agent until the transaction has been completed. This will cause increased memory usage.

I've given some specific advise for your situation in the ticket (avoid sqlx.Rebind()), but the general problem is a limitation of any tracing implementation that attempts to send complete traces.

Anyway, I work on profiling. I just got interested in this because the runtime/pprof.WithLabel call in the original report is part of the profiler.

I'll leave it to the @DataDog/tracing-go team to comment/confirm.

felixge avatar Jul 12 '22 08:07 felixge

Hello! 👋 I think @felixge has given some great pointers here. I'll just add that one area that may help here is "Partial Flushing" #1428 (TL;DR send completed spans before the entire trace is complete). This can relieve memory pressure caused by large, long running traces with many shorter spans that are currently hanging around in memory. This feature is something on our rough roadmap but not something we have a timeline for just yet. We've added this issue to our internal feature request tracking this which we'll use to prioritize working on this.

If you are still experiencing very high memory usage and would like other possible work-arounds or more direct eyes on, feel free to open a ticket with Datadog Support. Thanks!

ajgajg1134 avatar Oct 04 '22 18:10 ajgajg1134

Related to this, does the dd-trace-go support TraceInterceptor, as the dd-trace-java does?

We wan't to be able to filter/drop spans in long-lived queue jobs.

NicklasWallgren avatar Oct 20 '22 10:10 NicklasWallgren

Facing a similar issue while tracing bidirectional gRPC streams. Details here: https://github.com/DataDog/dd-trace-go/issues/1428#issuecomment-1321157069

Abhishekvrshny avatar Nov 20 '22 15:11 Abhishekvrshny

Related to this, does the dd-trace-go support TraceInterceptor, as the dd-trace-java does?

We wan't to be able to filter/drop spans in long-lived queue jobs. Can you open a separate github issue or reach out to datadog support for this? I think we might need some more details of your setup / use-case to better understand this ask.

Support for Partial Flushing #1428 will be started soon, once that's released let's check to see if it improves your memory usage here

ajgajg1134 avatar Mar 21 '23 14:03 ajgajg1134