dd-trace-go icon indicating copy to clipboard operation
dd-trace-go copied to clipboard

[BUG]: opentelemetry span.End() overrides user-set error fields (error.message, error.type, error.stack)

Open kmrgirish opened this issue 6 months ago • 5 comments

Tracer Version(s)

all versions <= v2.0.1

Go Version(s)

go1.24.3 darwin/arm64

Bug Report

We’re using the Datadog OpenTelemetry bridge with the OpenTelemetry Go SDK. We noticed that when we manually set error.message, error.type, and error.stack, these fields are overwritten during span.End() if the span status is set to Error.

https://github.com/DataDog/dd-trace-go/blob/fe9272dcb82745b2fd352f7bb54dd0db4d96d1c1/ddtrace/opentelemetry/span.go#L85-L88

https://github.com/DataDog/dd-trace-go/blob/fe9272dcb82745b2fd352f7bb54dd0db4d96d1c1/ddtrace/tracer/span.go#L424-L432

This behavior causes our manually attached error metadata to be lost — most notably error.stack, which gets replaced by a stack trace pointing to internal middleware like interceptors or gRPC handlers, rather than our application’s actual error context.

Stack trace example Here is a sample where you can see the overwritten stack trace originating from middleware:

gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.(*span).setTagError
	/go/pkg/mod/gopkg.in/!data!dog/[email protected]/ddtrace/tracer/span.go:332
gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.(*span).Finish
	/go/pkg/mod/gopkg.in/!data!dog/[email protected]/ddtrace/tracer/span.go:477
gopkg.in/DataDog/dd-trace-go.v1/ddtrace/opentelemetry.(*span).End
	/go/pkg/mod/gopkg.in/!data!dog/[email protected]/ddtrace/opentelemetry/span.go:103
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.(*config).handleRPC
	/go/pkg/mod/go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/[email protected]/stats_handler.go:204
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.(*serverHandler).HandleRPC
	/go/pkg/mod/go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/[email protected]/stats_handler.go:81
google.golang.org/grpc.(*Server).processUnaryRPC.func1
	/go/pkg/mod/google.golang.org/[email protected]/server.go:1262
google.golang.org/grpc.(*Server).processUnaryRPC
	/go/pkg/mod/google.golang.org/[email protected]/server.go:1427
google.golang.org/grpc.(*Server).handleStream
	/go/pkg/mod/google.golang.org/[email protected]/server.go:1802
google.golang.org/grpc.(*Server).serveStreams.func2.1
	/go/pkg/mod/google.golang.org/[email protected]/server.go:1030
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1700

This is not helpful when trying to debug real issues from user code.


Expected behavior

If a user has already set error.message, error.type, or error.stack, the tracer should not overwrite them. These fields should be treated as user-defined and preserved.

Actual behavior

When a span ends with status code = Error, the ddotel implementation internally creates a new error using the statusInfo.description, and calls tracer.WithError(...). This leads to:

  • Overwriting error.message with the statusInfo.description
  • Overwriting error.type using reflect.TypeOf(error)
  • Overwriting error.stack with a new stack trace — which, in our case, only includes internal tracing layers, not the actual error origin

We are currently using [email protected]. I also checked the [email protected] — while we haven’t tested it in our setup yet, the codebase appears to follow the same logic, so the issue likely persists in v2 as well.

This issue appears to be related to #3424, which highlights a similar problem regarding error field overwriting.

Reproduction Code

No response

Error Logs

No response

Go Env Output

No response

kmrgirish avatar Jun 30 '25 05:06 kmrgirish

Hi, I’ve opened a PR to address this issue: https://github.com/DataDog/dd-trace-go/pull/3702.

The goal of this change is to prevent (*span).End() from overriding user-defined error fields like error.message, error.type, and error.stack, which currently get replaced if the span status is Error.

Summary of the approach:

  • The fix relies on the user explicitly setting the error attribute on the span. If error is present and of type error, then (*span).End() does not fall back to statusInfo.description to create a new error.
  • If the user only sets error.message, error.type, or error.stack but not the error attribute, those will still be overwritten — this is a limitation of the current approach.
  • Additionally, setTagError now detects whether the error implements optional methods:
    • ErrorType() string → used for error.type
    • ErrorStack() string → used for error.stack

This gives users more control over how their custom errors are rendered in the trace.

I’m not sure if this is the direction the maintainers would prefer for solving this. If there’s a more idiomatic or internal way to handle this in the library, please feel free to guide me on how to improve it. Or, if you think this shouldn’t be fixed this way, I completely understand if the PR is closed.

Looking forward to your thoughts.

kmrgirish avatar Jun 30 '25 06:06 kmrgirish

Adding a bit more context on why the PR gates on the presence of the error attribute instead of checking error.message / error.type / error.stack individually:

  • Those three fields should be treated as an atomic set.
    If we let the library “fill in the blanks” field-by-field, we can end up with a half-user / half-auto-generated error, e.g.:
Field Value provided by user Value finally sent to Datadog
error.message custom kept (user value)
error.type — (omitted) auto-generated (reflect.TypeOf(statusErr))
error.stack — (omitted) auto-generated (middleware stack)
  • This mixed state is confusing in the UI and makes root-cause analysis harder.

  • OpenTelemetry’s Go API doesn’t enforce setting them together.
    Nothing stops a caller from setting only error.message. Relying on individual checks would therefore let inconsistent combinations leak through.

  • Using a single error attribute is the clearest signal that the caller intends to supply the full error context. If error is missing (or nil), falling back to statusInfo.description still makes sense.


That’s the reasoning behind the current approach, but I’m open to alternatives if there’s a better way to guarantee consistency.

kmrgirish avatar Jun 30 '25 06:06 kmrgirish

Hey @kmrgirish ,

Thanks very much for your detailed contribution!

I tested setting error attributes on a custom span, and these were not overwritten by (*span).End() — presumably because the if s.statusInfo.code == otelcodes.Error condition (ref) is not hit.

package main

import (
	"context"

	ddotel "gopkg.in/DataDog/dd-trace-go.v1/ddtrace/opentelemetry"
	ddtracer "gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer"

	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/attribute"
)

func main() {
	// register tracer
	provider := ddotel.NewTracerProvider(ddtracer.WithDebugMode(true))
	defer provider.Shutdown()
	otel.SetTracerProvider(provider)
	t := otel.Tracer("")
	
	// start span with custom error attributes
	_, sp := t.Start(context.Background(), "start")
	sp.SetAttributes(attribute.String("error.message", "custom error"))
	sp.SetAttributes(attribute.String("error.type", "custom error type"))
	sp.SetAttributes(attribute.String("error.stack", "custom error stack"))
	sp.End()
}

Debug logs (Note the error.message:custom error error.stack:custom error stack error.type:custom error type):

2025/06/30 15:16:03 Datadog Tracer v1.74.2 DEBUG: Finished Span: dd.trace_id="6862e27300000000091f3a599d0377dd" dd.span_id="657308227180132317" dd.parent_id="0", Operation: internal, Resource: start, Tags: map[error.message:custom error error.stack:custom error stack error.type:custom error type language:go runtime-id:2e80882f-a1db-4636-af58-faee99ac4bfa span.kind:internal], map[_dd.agent_psr:1 _dd.profiling.enabled:0 _dd.top_level:1 _dd.trace_span_attribute_schema:0 _sampling_priority_v1:1 process_id:46276]

Trace in UI: Image

Can you share how your environment differs?

Thank you!

mtoffl01 avatar Jun 30 '25 19:06 mtoffl01

@mtoffl01 We’re seeing this behavior only when a span is explicitly marked as failed using SetStatus, which is commonly done by most tracing interceptors.

For example: https://github.com/open-telemetry/opentelemetry-go-contrib/blob/instrumentation/google.golang.org/grpc/otelgrpc/v0.62.0/instrumentation/google.golang.org/grpc/otelgrpc/stats_handler.go#L339

In cases where SetStatus is not used, the error stack is preserved correctly. However, when the span status is set to Error, the Datadog library seems to override any previously set error stack. Since most open-source interceptors do set the span status to Error, this results in user-defined error stacks being unintentionally overwritten.

kmrgirish avatar Jul 07 '25 06:07 kmrgirish

Just to add more context

Most OpenTelemetry interceptors, including the gRPC ones, set the span status to Error when the operation fails — for example, when the gRPC response code is non-zero. This behavior causes the Datadog library to override any previously set error.stack, which results in the loss of meaningful stack traces, even if they were explicitly recorded earlier. In production, this often leaves us with spans that indicate failure but contain no useful stack information for debugging. While it’s technically possible to work around this by forking the interceptors and removing the SetStatus calls, this feels quite hacky to me.

Additionally, AFAIK, Datadog relies on the span status for intelligent sampling — for instance, retaining all spans marked as failed — which makes this interaction even more critical to handle correctly. Ideally, the library should respect explicitly set error fields regardless of the span status.

kmrgirish avatar Jul 07 '25 06:07 kmrgirish