(AspNet.TelemetryHttpModule) Activity.Current not restored before all IIS LifeCycle events
Bug Report
Activity.Current lost in some IIS lifecycle events
We have a custom HttpApplication with handlers for most IIS lifecycle events. These events can start child spans (mostly sql) that should be attached to the current request trace. For a percentage of the IIS lifecycle events, we are seeing traces that don't appear to be properly restoring Activity.Current, causing child spans to start their own trace.
OpenTelemetry = 1.2.0
OpenTelemetry.Api = 1.2.0
OpenTelemetry.Exporter.Zipkin = 1.2.0
OpenTelemetry.Instrumentation.AspNet = 1.0.0-rc9.2
OpenTelemetry.Instrumentation.AspNet.TelemetryHttpModule = 1.0.0-rc9.2
OpenTelemetry.Instrumentation.SqlClient = 1.0.0-rc9.2
Runtime version net48
Symptom
I believe Activity.Current is null
What is the expected behavior?
Activities started in IIS lifecycle methods should be attached to the request trace to which they belong
What is the actual behavior?
The Activities are orphaned, creating single span traces.
Additional Context
We aren't seeing this in all lifecycle events, only some of them and only a certain percentage of the time.
Here is the approximate percentage of time we're seeing this occur by lifecycle method. It happens most frequently in PreSendRequestHeaders (nearly every time)
Unaffected
IIS Life Cycle Method % parent span exists
PostAuthenticateRequest 1
PostMapRequestHandler 1
OnAuthenticate 1
BeginRequest 1
RequestHandlerExecute 0.99999 (assuming this is probably not a real issue)
Affected by the bug
PreRequestHandlerExecute 0.98452
PostRequestHandlerExecute 0.98104
PostLogRequest 0.83814
PreSendRequestHeaders 0.0085
How are we tracking this? We have a custom span processor that adds the current lifecycle method to each span by reading it from HttpContext.Items
I believe this could be resolved by changing this line https://github.com/open-telemetry/opentelemetry-dotnet/blob/55c5dd18e9bcb8f9f728aa1a0557bd5157c88362/src/OpenTelemetry.Instrumentation.AspNet.TelemetryHttpModule/TelemetryHttpModule.cs#L88-L91
to
if (!context.IsPostNotification)
{
ActivityHelper.RestoreContextIfNeeded(context.ApplicationInstance.Context);
}
@chris-smith-zocdoc
So that rule is just something we carried over from the original version here: https://github.com/aspnet/Microsoft.AspNet.TelemetryCorrelation/blob/50f0acb7a0cb164d834eb91b839959a841577302/src/Microsoft.AspNet.TelemetryCorrelation/TelemetryCorrelationHttpModule.cs#L84
I don't think anyone is opposed to changing it.
Could you possibly share a small repo that reproduces the problem so I can take it for a spin?
Thanks @CodeBlanch I've setup a minimum reproduction here https://github.com/chris-smith-zocdoc/OtelSample
You should be able to open the solution and hit F5 to run under IIS Express to see this easily.
Set a breakpoint on https://github.com/chris-smith-zocdoc/OtelSample/blob/master/Global.asax.cs#L39
@chris-smith-zocdoc Thanks for that code, very helpful! I took it for a spin, here's my findings...
It is a bit more complicated than simply removing context.CurrentNotification == RequestNotification.ExecuteRequestHandler.
The order of events firing in OnExecuteRequestStep looks like this:
- BeginRequest (Activity started)
- AuthenticateRequest
- AuthorizeRequest
- ResolveRequestCache
- MapRequestHandler
- AcquireRequestState
- PreExecuteRequestHandler
- ExecuteRequestHandler
- ReleaseRequestState
- UpdateRequestCache
- LogRequest
- EndRequest (Activity is stopped)
- SendResponse (Application_PreSendRequestHeaders/PreSendRequestContent called at this point)
That last line is the important one. Even if we call the restore code, the Activity has been stopped so there is nothing to restore 🤣
We could move the stop to the SendResponse event. I need to test this IIS URL Rewrite logic in there to make sure that works. BUT, let me ask first, @cijothomas @alanwest @reyang how do we feel about changing how this instrumentation works so that some of the time spent sending the response starts to be included in the span duration?
@CodeBlanch that may be true for PreSendRequestHeaders but not the other events which were showing smaller amounts of loss (~2% to 16%)
PreRequestHandlerExecute, PostRequestHandlerExecute, PostLogRequest
So I think there are two decisions here:
- can we restore before all events? I'd be happy to send a pr for this
- should we move the Activity stop call
@CodeBlanch I opened https://github.com/open-telemetry/opentelemetry-dotnet-contrib/pull/761 to restore the context before all events