river icon indicating copy to clipboard operation
river copied to clipboard

DataDog integration (dd-trace-go)

Open wijayaerick opened this issue 1 year ago • 8 comments

I've been using this library and wrote custom DataDog integration (observability service) for river. It would be nice if we can introduce DataDog integration directly in dd-trace-go (contrib packages). I wrote the integration because I need to make sure that I can easily debug upon releasing in production, and as such, I think having such integration will help people to adopt this library if they happen to use DataDog.

If you don't mind, I'm willing to help adding the integration in dd-trace-go repo.

Thank you for writing such amazing library!

wijayaerick avatar Dec 27 '24 11:12 wijayaerick

This is the draft for the integration. The actual implementation I have in production predates the job middleware, so I have yet to test the following code. It will take awhile for me to test as I need to migrate the db before upgrading river.

Insert:

type RiverInsertMiddleware struct {
	river.JobInsertMiddlewareDefaults
	logger *slog.Logger
}

func (m *RiverInsertMiddleware) InsertMany(
	ctx context.Context,
	manyParams []*rivertype.JobInsertParams,
	doInner func(ctx context.Context) ([]*rivertype.JobInsertResult, error),
) ([]*rivertype.JobInsertResult, error) {
	if len(manyParams) == 0 {
		return doInner(ctx)
	}

	span, ok := tracer.SpanFromContext(ctx)
	if !ok {
		return doInner(ctx)
	}
	spanCtx := span.Context()

	for _, param := range manyParams {
		metadataWithTrace, err := injectTraceToRiverMetadata(spanCtx, param.Metadata)
		if err != nil {
			m.logger.ErrorContext(ctx, "failed to inject trace to job metadata", "err", err)
			continue
		}

		param.Metadata = metadataWithTrace
	}

	return doInner(ctx)
}

Worker:

type RiverWorkerMiddleware struct {
	river.WorkerMiddlewareDefaults
}

func (m *RiverWorkerMiddleware) Work(ctx context.Context, job *rivertype.JobRow, doInner func(ctx context.Context) error) (err error) {
	opts := []ddtrace.StartSpanOption{
		tracer.ResourceName(job.Kind),
		tracer.SpanType(ext.SpanTypeMessageConsumer),
		tracer.Tag("river_job.id", job.ID),
		tracer.Tag("river_job.scheduled_at", job.ScheduledAt),
		tracer.Tag("river_job.queue", job.Queue),
		tracer.Tag("river_job.kind", job.Kind),
		tracer.Tag("river_job.attempt", job.Attempt),
		tracer.Tag(ext.Component, componentName),
		tracer.Tag(ext.SpanKind, ext.SpanKindConsumer),
		tracer.Tag(ext.MessagingSystem, "river"),
		tracer.Measured(),
	}

	if parentSpanCtx, err := extractRiverJobSpanCtx(job); err == nil {
		opts = append(opts, tracer.ChildOf(parentSpanCtx))
	}

	span, ctx := tracer.StartSpanFromContext(ctx, "river.work", opts...)
	defer func() {
		if err != nil {
			TagError(ctx, err)
		}
		span.Finish()
	}()

	return doInner(ctx)
}

func extractRiverJobSpanCtx(job *rivertype.JobRow) (ddtrace.SpanContext, error) {
	var carrier MapCarrier
	if err := json.Unmarshal(job.Metadata, &carrier); err != nil {
		return nil, err
	}
	return tracer.Extract(carrier)
}

wijayaerick avatar Dec 27 '24 11:12 wijayaerick

Hey @wijayaerick, this looks great. Thank you!

Would love to help get this into DataDog's contrib! It might not be the worst idea to see if you can get River upgraded just so you could try the new middleware pattern, and make sure that any code we write plays nicely with in realistic stack (yours) before we go too much further. Other than that, looks pretty doable. From the other contrib folders in that repo, looks like they'll want a basic set of tests, but that shouldn't be difficult.

brandur avatar Dec 27 '24 16:12 brandur

@wijayaerick we do something very similar to what you have here at my day job. Main things we do that I don't see here:

  • We use tracer.WithError to capture errors from the work function, and we also recover panics to add them to the trace before repanicking (not sure if that's best practice).
  • We set the dd.trace_id field in our logger so that we get log correlation. Probably doable in a vendor-agnostic way using slog.

dgunay avatar Feb 20 '25 03:02 dgunay

Thanks @dgunay. You're correct that we should use tracer.WithError. Regarding your other points:

  • I won't be adding panic recovery in ddtrace as clients will have different ways to (or not to) handle them.
  • For debug logging, I implement the contrib lib by using gopkg.in/DataDog/dd-trace-go.v1/internal/log as it's currently what other contrib packages are using as well.

I'm currently working on the tests, and I'm looking to open a PR soon.

wijayaerick avatar Feb 21 '25 09:02 wijayaerick

@wijayaerick Hey Erick, I'm thinking of tackling an officially supported DataDog middleware next seeing as we've gotten a lot of requests for it.

Do you want to dump what you have so far up somewhere? Optional of course, but if you do, I'll try to work what you put together into my change, or if yours is close to finished, we could work towards refining it to become the officially supported version instead.

Edit: Or I guess you already wrote in most of it in your comment above. Feel free to write another comment if that's changed significantly, or I can just go off that.

brandur avatar Feb 27 '25 14:02 brandur

Hi @brandur, the code I've worked so far is accessible in https://github.com/wijayaerick/dd-trace-go/tree/contrib/riverqueue/river. It's quite close to finished as I just need to add a few more testcases

wijayaerick avatar Feb 28 '25 08:02 wijayaerick

Hi @brandur, I've created PR in https://github.com/DataDog/dd-trace-go/pull/3245 Feel free to let me know if there's anything in the code you'd like to add or change (especially the naming for span tags)

wijayaerick avatar Mar 03 '25 15:03 wijayaerick

@wijayaerick Thank you! Will do.

brandur avatar Mar 03 '25 16:03 brandur

We released an official OpenTelemetry package which should resolve this:

https://github.com/riverqueue/rivercontrib/tree/master/otelriver

Example use with DataDog:

https://github.com/riverqueue/rivercontrib/tree/master/datadogriver

brandur avatar May 10 '25 18:05 brandur

Hi @brandur,

I've recently tried using otelriver and it works out of the box 🎉 . I can see the traces and spans in the DataDog APM.

I have yet to try using a meter provider (coz I havent figured out how to initialize metric.MeterProvider) but looking at the code I should be able to migrate the metrics as well once I figured out metric.MeterProvider (Currently I implemented metrics by using client.Subscribe).

With the otelriver package exists, what would be the status for the datadog contrib package? I have some unpushed changes addressing your PR comment (disabling distributed tracing by default and add option to enable it), but apart from that I'd you to decide whether we should still proceed forward with it, or close the PR.

Apologies for the late reply 🙏.

wijayaerick avatar May 11 '25 14:05 wijayaerick

Hey @wijayaerick, no worries!

I have yet to try using a meter provider (coz I havent figured out how to initialize metric.MeterProvider) but looking at the code I should be able to migrate the metrics as well once I figured out metric.MeterProvider (Currently I implemented metrics by using client.Subscribe).

Did you see these couple lines of code?

https://github.com/riverqueue/rivercontrib/blob/741fb02bfb2c94b922eccec208acd8fd36980f99/datadogriver/example_global_provider_test.go#L17-L19

It should be that easy to instantiate a global DataDog provider, and from there the River middleware will automatically use it.

With the otelriver package exists, what would be the status for the datadog contrib package? I have some unpushed changes addressing your PR comment (disabling distributed tracing by default and add option to enable it), but apart from that I'd you to decide whether we should still proceed forward with it, or close the PR.

Personally, I'd close it. It was a good idea, but if we can get away with OpenTelemetry instead, it's less overall to maintain because we get one package for many providers. Also, in case we need to make changes it's wayyyyyy easier to land them in our own contrib repositories rather than try to send them upstream to DataDog. We've already seen based on your PR that they're not particularly responsive anyway.

brandur avatar May 11 '25 16:05 brandur