opentelemetry-dotnet icon indicating copy to clipboard operation
opentelemetry-dotnet copied to clipboard

A span with a new span ID MUST be created even for a sampling decision of DROP.

Open Oberon00 opened this issue 3 years ago • 13 comments

Bug Report

This bug should be present in master and at least since 0.5.0-beta (probably since the beginning)

Symptom

If the sampler returns a decision of DROP, StartActivity returns null.

What is the expected behavior?

An unsampled activity with a new span ID is created and returned (and set as current which is not spec-conformant but expected in .NET). This allows correlating logs also to unsampled operations.

Spec references:

https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/sdk.md#shouldsample

DROP - IsRecording() == false, span will not be recorded and all events and attributes will be dropped.

https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/sdk.md#sdk-span-creation

  1. Generate a new span ID for the Span, independently of the sampling decision. This is done so other components (such as logs or exception handling) can rely on a unique span ID, even if the Span is a non-recording instance.
  2. [...] A non-recording span MAY be implemented using the same mechanism as when a Span is created without an SDK installed or as described in wrapping a SpanContext in a Span.

https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/api.md#wrapping-a-spancontext-in-a-span

The behavior is defined as follows:

  • GetContext MUST return the wrapped SpanContext.

What is the actual behavior?

Null is returned for the activity, no span ID is available.

Reproduce

There is a unit test for the wrong behavior here: https://github.com/open-telemetry/opentelemetry-dotnet/blob/641b2f76f1e3b9b1aa1697ab47ec24aed31a45c3/test/OpenTelemetry.Tests/Trace/TracerProviderSdkTest.cs#L274-L277

Additional Context

This is a pretty big deviation from the spec that will inhibit some use cases. Creating a new span ID even for unsampled spans was a very deliberate decision. There were even two competing PRs https://github.com/open-telemetry/opentelemetry-specification/pull/1225 and https://github.com/open-telemetry/opentelemetry-specification/pull/998 and it was explicitly decided on always creating a span ID. See especially the thread ending at https://github.com/open-telemetry/opentelemetry-specification/pull/998#discussion_r522458365

Oberon00 avatar May 19 '22 15:05 Oberon00

Note that in the case I observed, I even got a null activity despite there being no current span (the parent context was extracted from HTTP headers, having IsRemote=true, TraceFlags=None)

Oberon00 avatar May 19 '22 15:05 Oberon00

Up voting this issue. Furthermore, the behavior of StartActivity should be such that an activity is created with or without the sampled flag turned on, depending on the sampler settings, and the sampler should be such that an exception can change the DROP decision of the Activity and All its parents to record failed Spans created locally.

There should be at least the option to be able to record the root activity resulting of a transaction with no parent, like an API entry point which should return the traceID in the response (https://github.com/w3c/trace-context/blob/main/spec/21-http_response_header_format.md) . The returned TraceID (with possibly an orphan flag) then can be used to query the DB for spans with exceptions coming from an unsampled chain of Spans that where not sampled.

The main objective is to sample a percentage of the healthy transactions, while recording 100% of the local spans that resulted in an error/exception. We must be able to answer the question as why a transaction 00-ttt-xxx-00 failed. Without this distributed tracing becomes no better than debug logs.

bledogit avatar Aug 15 '22 23:08 bledogit

Can someone comment on any performance implications of creating activity object always even when sampling decision was to not collect current span ?

erchirag avatar Nov 01 '22 20:11 erchirag

https://github.com/open-telemetry/opentelemetry-dotnet/blob/main/test/Benchmarks/Trace/TraceBenchmarks.cs#L38 Creating an activity, and feeding it to a no-op processor, is approximately of this (see above ^) much cost.

cijothomas avatar Nov 01 '22 20:11 cijothomas

https://github.com/open-telemetry/opentelemetry-dotnet/blob/main/test/Benchmarks/Trace/TraceBenchmarks.cs#L38 Creating an activity, and feeding it to a no-op processor, is approximately of this (see above ^) much cost.

https://github.com/open-telemetry/opentelemetry-dotnet/blob/main/test/Benchmarks/Trace/OpenTelemetrySdkBenchmarksActivity.cs#L38

And can go up, if tags/attributes are added. (most likely avoidable, by checking IsAllDataRequested before populating tags)

cijothomas avatar Nov 01 '22 20:11 cijothomas

What about any GC impact if lots of these objects are created in tight loop ?

erchirag avatar Nov 01 '22 21:11 erchirag

The benchmarks shows the heap allocation (416B), so yes GC has to clean it up.

cijothomas avatar Nov 01 '22 21:11 cijothomas

If performance is an issue, can't these objects be pooled?

bledogit avatar Nov 02 '22 13:11 bledogit

Tagging for consideration in 1.5.0

cijothomas avatar Feb 14 '23 21:02 cijothomas

+1. I think this is the root cause of #4087.

TimothyMothra avatar Feb 16 '23 01:02 TimothyMothra

The PR to address got stale and closed. I suggest to move this out of 1.5 timeline, as this likely requires more discussions.

cijothomas avatar Mar 21 '23 18:03 cijothomas