ApplicationInsights-dotnet
ApplicationInsights-dotnet copied to clipboard
TelemetryClient.Flush deadlocks
If you are reporting bug/issue, please provide detailed Repro instructions.
Repro Steps
- netcoreapp2.2 console app
- telemetry configuration:
- ServerTelemetryChannel
- DependencyTrackingTelemetryModule and UnobservedExceptionTelemetryModule
- call TelemetryClient.Flush on workload completion (async Task
Main)
Actual Behavior
The app deadlocks during Flush operation. Occurrence is reliably about 1 in 8000 runs in a consistent hardware and deployment environment.
Expected Behavior
not deadlock
Version Info
SDK Version : appinsights 2.10.0 .NET Version : netcore 2.2 How Application was onboarded with SDK(VisualStudio/StatusMonitor/Azure Extension) : OS : win10-x64 Hosting Info (IIS/Azure WebApps/ etc) : console
Two stacks seem relevant for this: Thread1
ntdll.dll!NtWaitForMultipleObjects() Unknown
KERNELBASE.dll!WaitForMultipleObjectsEx() Unknown
[Managed to Native Transition]
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.Extensibility.MetricSeriesAggregatorBase<double>.UpdateAggregate(Microsoft.ApplicationInsights.Metrics.Extensibility.MetricValuesBufferBase<double> buffer) Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.Extensibility.MetricSeriesAggregatorBase<double>.CompleteAggregation(System.DateTimeOffset periodEnd) Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.MetricAggregationManager.GetNonpersistentAggregations(System.DateTimeOffset tactTimestamp, Microsoft.ApplicationInsights.Metrics.MetricAggregationManager.AggregatorCollection aggregators) Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.MetricAggregationManager.CycleAggregators(ref Microsoft.ApplicationInsights.Metrics.MetricAggregationManager.AggregatorCollection aggregators, System.DateTimeOffset tactTimestamp, Microsoft.ApplicationInsights.Metrics.Extensibility.IMetricSeriesFilter futureFilter, bool stopAggregators) Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.MetricAggregationManager.StartOrCycleAggregators(Microsoft.ApplicationInsights.Metrics.Extensibility.MetricAggregationCycleKind aggregationCycleKind, System.DateTimeOffset tactTimestamp, Microsoft.ApplicationInsights.Metrics.Extensibility.IMetricSeriesFilter futureFilter) Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.MetricManager.Flush(bool flushDownstreamPipeline) Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.TelemetryClient.Flush() Unknown
Thread 2
ntdll.dll!NtDelayExecution() Unknown
KERNELBASE.dll!SleepEx() Unknown
[Managed to Native Transition]
System.Private.CoreLib.dll!System.Threading.SpinWait.SpinOnce(int sleep1Threshold) Line 169 C#
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.Extensibility.MetricValuesBufferBase<double>.GetAndResetValue(int index) Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.MeasurementAggregator.UpdateAggregate_Stage1(Microsoft.ApplicationInsights.Metrics.Extensibility.MetricValuesBufferBase<double> buffer, int minFlushIndex, int maxFlushIndex) Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.Extensibility.MetricSeriesAggregatorBase<double>.UpdateAggregate(Microsoft.ApplicationInsights.Metrics.Extensibility.MetricValuesBufferBase<double> buffer) Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.Extensibility.MetricSeriesAggregatorBase<double>.CompleteAggregation(System.DateTimeOffset periodEnd) Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.MetricAggregationManager.GetNonpersistentAggregations(System.DateTimeOffset tactTimestamp, Microsoft.ApplicationInsights.Metrics.MetricAggregationManager.AggregatorCollection aggregators) Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.MetricAggregationManager.CycleAggregators(ref Microsoft.ApplicationInsights.Metrics.MetricAggregationManager.AggregatorCollection aggregators, System.DateTimeOffset tactTimestamp, Microsoft.ApplicationInsights.Metrics.Extensibility.IMetricSeriesFilter futureFilter, bool stopAggregators) Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.MetricAggregationManager.StartOrCycleAggregators(Microsoft.ApplicationInsights.Metrics.Extensibility.MetricAggregationCycleKind aggregationCycleKind, System.DateTimeOffset tactTimestamp, Microsoft.ApplicationInsights.Metrics.Extensibility.IMetricSeriesFilter futureFilter) Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.DefaultAggregationPeriodCycle.FetchAndTrackMetrics() Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.DefaultAggregationPeriodCycle.Run() Unknown
System.Threading.Thread.dll!System.Threading.Thread.ThreadMain_ThreadStart() Line 93 C#
System.Private.CoreLib.dll!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, object state) Line 167 C#
[Native to Managed Transition]
kernel32.dll!BaseThreadInitThunk() Unknown
ntdll.dll!RtlUserThreadStart() Unknown
I'm encountering this issue on consistent basis. Can anyone advise on a solution or workaround?
@macrogreg Can you take a look at this?
I think it's related to the lock in MetricSeriesAggregatorBase.UpdateAggregate(): https://github.com/microsoft/ApplicationInsights-dotnet/blob/f181abbcdf6a6a2a4aea821ce1d72e98b0c68fbe/src/Microsoft.ApplicationInsights/Metrics/Extensibility/MetricSeriesAggregatorBase.cs#L392-L394
As a workaround, if you're not using the GetMetric().Track() api you can remove the Metrics pre aggregator Processor from your configuration.
As per recommendation in API documentation, which discourages old metric patterns, we're using GetMetric() and TrackValue(). Hence, unfortunately, the above workaround would not help.
This is one of the problems with Flush API and while this particular issue is pretty bad and we want to fix it along with Flush. This cycle we won't have the bandwidth for it, so moving it to the next milestone.
I think I see the same issue with AI 2.4. Since I am not using GetMetric().Track(), how can I remove the Metrics pre aggregator Processor from my configuration as suggested above?
@rafist Are you on asp.net core? then do the following to remove metric pre aggregator: public void ConfigureServices(IServiceCollection services) { Microsoft.ApplicationInsights.AspNetCore.Extensions.ApplicationInsightsServiceOptions aiOptions = new Microsoft.ApplicationInsights.AspNetCore.Extensions.ApplicationInsightsServiceOptions(); // disable autocollected metric extractor aiOptions.AddAutoCollectedMetricExtractor= false; services.AddApplicationInsightsTelemetry(aiOptions); }
If you are on Asp.Net, then comment out the following processor from AI.Config <Add Type="Microsoft.ApplicationInsights.Extensibility.AutocollectedMetricsExtractor, Microsoft.ApplicationInsights"/>
moving to next milestone, as there are no cycles to investigate this.
Thanks @cijothomas, my issue was not related to the Flush after all.
@rafist , Could you please share more details on actual issue. Could you please share details, if it is resolved for you.
@cijothomas / @lmolkova / @TimothyMothra , We are using Log4Net, Application Insights Appender to log to App Insights. TelemetryClient.Flush method is called after every TelemetryClient.TrackTrace method. After a few hours/sometimes immediately, all other application API calls in my application are failing with Task Cancelled exceptions. When we disable the Application Insights Appender in the Log4net.config file, all my API calls are succeeding. We are using .NET Framework 4.6.2 and Application Insights Appender v2.11.0.0. it seems the Application Insights is blocking other API calls. My Application is an Azure Worker Role.
Could you please let us know if this is the known issue in the Flush method. If yes, please kindly share any workaround and possible fix timelines.
Thanks Anjan
Moving further from current milestones.
This is consistently happening to us in production also. We are calling flush from two places and in some scenarios that seems to cause the deadlock (it may be actually livelock). I am attaching stack traces from debugger, they start at the bottom after the calls from our code to the sdk. When I try to resume and pause again the debugger, the code doesn't seem to move (that however doesn't tell us much since it just may be improbable for me to hit anything but the spinwait).
This manifests when the application should shut down as both calls to flush are started only at that point. In most cases this doesn't happen and shutdown completes normally. When the deadlock occurs, metrics about cpu and memory usage are still sent.
We are using only services.AddApplicationInsightsTelemetryWorkerService() with no default overrides.
(sorry for these being screens but vscode ignores me when I ask it to "copy the stack trace")

+1 A similar/probably same issue was reported outside of this repo.
The issue is already tagged p1, but not yet assigned a milestone. Will mark for the next release.
2.17 release date is not yet announced. Will update here, once a firm date is set.
@cijothomas are you sure this will be resolved in 2.17? I've noticed the milestone for this issue keeps getting pushed out. As a potential work around, do you think we could try invoking Flush() from a background thread, and if it hasn't completed in a reasonable time, attempt to Interrupt() the thread. This occurs for us when a service is shutting down. The deadlock or livelock is actually preventing the service from shutting down. So while dispatching on a background thread and interrupting if needed is hacky, it would potentially be a solution until there is a more concrete fix. The reason why we call Flush() on shutdown, is so that we ensure any buffered Telemetry is sent to AI before the process exits.
This is consistently happening to us in production also. We are calling flush from two places and in some scenarios that seems to cause the deadlock (it may be actually livelock). I am attaching stack traces from debugger, they start at the bottom after the calls from our code to the sdk. When I try to resume and pause again the debugger, the code doesn't seem to move (that however doesn't tell us much since it just may be improbable for me to hit anything but the spinwait).
This manifests when the application should shut down as both calls to flush are started only at that point. In most cases this doesn't happen and shutdown completes normally. When the deadlock occurs, metrics about cpu and memory usage are still sent.
We are using only
services.AddApplicationInsightsTelemetryWorkerService()with no default overrides.(sorry for these being screens but vscode ignores me when I ask it to "copy the stack trace")
![]()
Similar this is happening for my code as well. Have used the below settings
var aiOptions = new Microsoft.ApplicationInsights.AspNetCore.Extensions.ApplicationInsightsServiceOptions { InstrumentationKey = instrumentationKey, EnableAdaptiveSampling = false, EnableQuickPulseMetricStream = false, EnablePerformanceCounterCollectionModule = false, };
but still getting high number of threads and results in 100% cpu. See below for thread details.

any updates would be helpful. In the meantime , will try 'AddAutoCollectedMetricExtractor' to false and monitory.
@shoguns6 It looks like you are reporting a different issue than the Flush deadlocks. Metrics aggregator (DefaultAggregationPeriodCycle) is only expected to create one Thread. If you are seeing more than one (461 in your screenshot), it likely means TelemetryConfig/Client is being recreated instead of re-using. Could you check if this is possible in your application.
One more +1 as this was reported by another user.
A tactical fix can be made to prevent Flush from waiting (Spinning) indefinitely. It should avoid the currently reported deadlock situations. As 2.18 is nearing stable release, adding this to the next milestone 2.19. The ETA for the stable release of 2.19 is not finalized, but it'd be before Nov 2021. The 1st beta containing this fix is expected in the very 1st beta of 2.19.
Just another data point, we also have this problem. We have experienced it randomly since at least early 2021. Quite frustrating that it has not been possible to fix it even after 20 point releases of AI.
Customer request to add this item to 2.21 release if possible.
Same problem here. We had it two weeks ago for the first time using 2.18. We had is consistently multiple times during the day on different servers so we upgraded to 2.20 and yesterday this issue occured again. We have to leave Application Insights disabled for now, which is not ideal. Please include this in 2.21
Hello All, We released v2.21-beta2 today which contains a fix for this issue. If anyone can test this new beta, please write back and share if the issue is resolved for you.
Hello All, We released v2.21.0 today, which contains a fix for this issue. We're going to leave this issue open until we get confirmation that this issue has been resolved.
We have created a backlog item to upgrade the SDK and test the fix.
If anyone who experienced this issue can confirm that the latest version fixes the issue, it would be greatly appreciated.
Apologies for the delay. We have this scheduled for the upcoming sprint, since we saw the deadlock again in integration. So we will upgrade to the new library.
From: Cijo Thomas @.> Sent: Friday, September 16, 2022 2:44 PM To: microsoft/ApplicationInsights-dotnet @.> Cc: [ɴιɴlαr] @.>; Comment @.> Subject: Re: [microsoft/ApplicationInsights-dotnet] TelemetryClient.Flush deadlocks (#1186)
If anyone who experienced this issue can confirm that the latest version fixes the issue, it would be greatly appreciated.
— Reply to this email directly, view it on GitHubhttps://github.com/microsoft/ApplicationInsights-dotnet/issues/1186#issuecomment-1249852836, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAIHCW3I6CHO43Q6NYIZQP3V6TS5RANCNFSM4IISHSYA. You are receiving this because you commented.Message ID: @.***>
We have the latest version now running for 5 months, and the issue did not occur anymore. Thanks for looking into it.