ApplicationInsights-dotnet icon indicating copy to clipboard operation
ApplicationInsights-dotnet copied to clipboard

TelemetryClient.Flush deadlocks

Open mlugosan opened this issue 6 years ago • 30 comments

If you are reporting bug/issue, please provide detailed Repro instructions.

Repro Steps

  1. netcoreapp2.2 console app
  2. telemetry configuration:
  • ServerTelemetryChannel
  • DependencyTrackingTelemetryModule and UnobservedExceptionTelemetryModule
  1. call TelemetryClient.Flush on workload completion (async Task Main)

Actual Behavior

The app deadlocks during Flush operation. Occurrence is reliably about 1 in 8000 runs in a consistent hardware and deployment environment.

Expected Behavior

not deadlock

Version Info

SDK Version : appinsights 2.10.0 .NET Version : netcore 2.2 How Application was onboarded with SDK(VisualStudio/StatusMonitor/Azure Extension) : OS : win10-x64 Hosting Info (IIS/Azure WebApps/ etc) : console

Two stacks seem relevant for this: Thread1

ntdll.dll!NtWaitForMultipleObjects()	Unknown
KERNELBASE.dll!WaitForMultipleObjectsEx()	Unknown
[Managed to Native Transition]	
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.Extensibility.MetricSeriesAggregatorBase<double>.UpdateAggregate(Microsoft.ApplicationInsights.Metrics.Extensibility.MetricValuesBufferBase<double> buffer)	Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.Extensibility.MetricSeriesAggregatorBase<double>.CompleteAggregation(System.DateTimeOffset periodEnd)	Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.MetricAggregationManager.GetNonpersistentAggregations(System.DateTimeOffset tactTimestamp, Microsoft.ApplicationInsights.Metrics.MetricAggregationManager.AggregatorCollection aggregators)	Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.MetricAggregationManager.CycleAggregators(ref Microsoft.ApplicationInsights.Metrics.MetricAggregationManager.AggregatorCollection aggregators, System.DateTimeOffset tactTimestamp, Microsoft.ApplicationInsights.Metrics.Extensibility.IMetricSeriesFilter futureFilter, bool stopAggregators)	Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.MetricAggregationManager.StartOrCycleAggregators(Microsoft.ApplicationInsights.Metrics.Extensibility.MetricAggregationCycleKind aggregationCycleKind, System.DateTimeOffset tactTimestamp, Microsoft.ApplicationInsights.Metrics.Extensibility.IMetricSeriesFilter futureFilter)	Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.MetricManager.Flush(bool flushDownstreamPipeline)	Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.TelemetryClient.Flush()	Unknown

Thread 2

ntdll.dll!NtDelayExecution()	Unknown
KERNELBASE.dll!SleepEx()	Unknown
[Managed to Native Transition]	
System.Private.CoreLib.dll!System.Threading.SpinWait.SpinOnce(int sleep1Threshold) Line 169	C#
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.Extensibility.MetricValuesBufferBase<double>.GetAndResetValue(int index)	Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.MeasurementAggregator.UpdateAggregate_Stage1(Microsoft.ApplicationInsights.Metrics.Extensibility.MetricValuesBufferBase<double> buffer, int minFlushIndex, int maxFlushIndex)	Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.Extensibility.MetricSeriesAggregatorBase<double>.UpdateAggregate(Microsoft.ApplicationInsights.Metrics.Extensibility.MetricValuesBufferBase<double> buffer)	Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.Extensibility.MetricSeriesAggregatorBase<double>.CompleteAggregation(System.DateTimeOffset periodEnd)	Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.MetricAggregationManager.GetNonpersistentAggregations(System.DateTimeOffset tactTimestamp, Microsoft.ApplicationInsights.Metrics.MetricAggregationManager.AggregatorCollection aggregators)	Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.MetricAggregationManager.CycleAggregators(ref Microsoft.ApplicationInsights.Metrics.MetricAggregationManager.AggregatorCollection aggregators, System.DateTimeOffset tactTimestamp, Microsoft.ApplicationInsights.Metrics.Extensibility.IMetricSeriesFilter futureFilter, bool stopAggregators)	Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.MetricAggregationManager.StartOrCycleAggregators(Microsoft.ApplicationInsights.Metrics.Extensibility.MetricAggregationCycleKind aggregationCycleKind, System.DateTimeOffset tactTimestamp, Microsoft.ApplicationInsights.Metrics.Extensibility.IMetricSeriesFilter futureFilter)	Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.DefaultAggregationPeriodCycle.FetchAndTrackMetrics()	Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.DefaultAggregationPeriodCycle.Run()	Unknown
System.Threading.Thread.dll!System.Threading.Thread.ThreadMain_ThreadStart() Line 93	C#
System.Private.CoreLib.dll!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, object state) Line 167	C#
[Native to Managed Transition]	
kernel32.dll!BaseThreadInitThunk()	Unknown
ntdll.dll!RtlUserThreadStart()	Unknown

mlugosan avatar Aug 01 '19 16:08 mlugosan

I'm encountering this issue on consistent basis. Can anyone advise on a solution or workaround?

mlugosan avatar Aug 12 '19 15:08 mlugosan

@macrogreg Can you take a look at this?

I think it's related to the lock in MetricSeriesAggregatorBase.UpdateAggregate(): https://github.com/microsoft/ApplicationInsights-dotnet/blob/f181abbcdf6a6a2a4aea821ce1d72e98b0c68fbe/src/Microsoft.ApplicationInsights/Metrics/Extensibility/MetricSeriesAggregatorBase.cs#L392-L394

TimothyMothra avatar Aug 14 '19 19:08 TimothyMothra

As a workaround, if you're not using the GetMetric().Track() api you can remove the Metrics pre aggregator Processor from your configuration.

TimothyMothra avatar Aug 15 '19 18:08 TimothyMothra

As per recommendation in API documentation, which discourages old metric patterns, we're using GetMetric() and TrackValue(). Hence, unfortunately, the above workaround would not help.

mlugosan avatar Aug 15 '19 19:08 mlugosan

This is one of the problems with Flush API and while this particular issue is pretty bad and we want to fix it along with Flush. This cycle we won't have the bandwidth for it, so moving it to the next milestone.

lmolkova avatar Aug 26 '19 18:08 lmolkova

I think I see the same issue with AI 2.4. Since I am not using GetMetric().Track(), how can I remove the Metrics pre aggregator Processor from my configuration as suggested above?

rafist avatar Oct 11 '19 21:10 rafist

@rafist Are you on asp.net core? then do the following to remove metric pre aggregator: public void ConfigureServices(IServiceCollection services) { Microsoft.ApplicationInsights.AspNetCore.Extensions.ApplicationInsightsServiceOptions aiOptions = new Microsoft.ApplicationInsights.AspNetCore.Extensions.ApplicationInsightsServiceOptions(); // disable autocollected metric extractor aiOptions.AddAutoCollectedMetricExtractor= false; services.AddApplicationInsightsTelemetry(aiOptions); }

If you are on Asp.Net, then comment out the following processor from AI.Config <Add Type="Microsoft.ApplicationInsights.Extensibility.AutocollectedMetricsExtractor, Microsoft.ApplicationInsights"/>

cijothomas avatar Dec 06 '19 03:12 cijothomas

moving to next milestone, as there are no cycles to investigate this.

cijothomas avatar Dec 06 '19 03:12 cijothomas

Thanks @cijothomas, my issue was not related to the Flush after all.

rafist avatar Dec 31 '19 13:12 rafist

@rafist , Could you please share more details on actual issue. Could you please share details, if it is resolved for you.

anjanchidige avatar Jan 16 '20 10:01 anjanchidige

@cijothomas / @lmolkova / @TimothyMothra , We are using Log4Net, Application Insights Appender to log to App Insights. TelemetryClient.Flush method is called after every TelemetryClient.TrackTrace method. After a few hours/sometimes immediately, all other application API calls in my application are failing with Task Cancelled exceptions. When we disable the Application Insights Appender in the Log4net.config file, all my API calls are succeeding. We are using .NET Framework 4.6.2 and Application Insights Appender v2.11.0.0. it seems the Application Insights is blocking other API calls. My Application is an Azure Worker Role.

Could you please let us know if this is the known issue in the Flush method. If yes, please kindly share any workaround and possible fix timelines.

Thanks Anjan

anjanchidige avatar Jan 16 '20 13:01 anjanchidige

Moving further from current milestones.

cijothomas avatar Sep 01 '20 17:09 cijothomas

This is consistently happening to us in production also. We are calling flush from two places and in some scenarios that seems to cause the deadlock (it may be actually livelock). I am attaching stack traces from debugger, they start at the bottom after the calls from our code to the sdk. When I try to resume and pause again the debugger, the code doesn't seem to move (that however doesn't tell us much since it just may be improbable for me to hit anything but the spinwait).

This manifests when the application should shut down as both calls to flush are started only at that point. In most cases this doesn't happen and shutdown completes normally. When the deadlock occurs, metrics about cpu and memory usage are still sent.

We are using only services.AddApplicationInsightsTelemetryWorkerService() with no default overrides.

(sorry for these being screens but vscode ignores me when I ask it to "copy the stack trace") image image

mladedav avatar Nov 23 '20 11:11 mladedav

+1 A similar/probably same issue was reported outside of this repo.

The issue is already tagged p1, but not yet assigned a milestone. Will mark for the next release.

cijothomas avatar Nov 23 '20 15:11 cijothomas

2.17 release date is not yet announced. Will update here, once a firm date is set.

cijothomas avatar Nov 23 '20 15:11 cijothomas

@cijothomas are you sure this will be resolved in 2.17? I've noticed the milestone for this issue keeps getting pushed out. As a potential work around, do you think we could try invoking Flush() from a background thread, and if it hasn't completed in a reasonable time, attempt to Interrupt() the thread. This occurs for us when a service is shutting down. The deadlock or livelock is actually preventing the service from shutting down. So while dispatching on a background thread and interrupting if needed is hacky, it would potentially be a solution until there is a more concrete fix. The reason why we call Flush() on shutdown, is so that we ensure any buffered Telemetry is sent to AI before the process exits.

ninlar avatar Dec 02 '20 19:12 ninlar

This is consistently happening to us in production also. We are calling flush from two places and in some scenarios that seems to cause the deadlock (it may be actually livelock). I am attaching stack traces from debugger, they start at the bottom after the calls from our code to the sdk. When I try to resume and pause again the debugger, the code doesn't seem to move (that however doesn't tell us much since it just may be improbable for me to hit anything but the spinwait).

This manifests when the application should shut down as both calls to flush are started only at that point. In most cases this doesn't happen and shutdown completes normally. When the deadlock occurs, metrics about cpu and memory usage are still sent.

We are using only services.AddApplicationInsightsTelemetryWorkerService() with no default overrides.

(sorry for these being screens but vscode ignores me when I ask it to "copy the stack trace") image image

Similar this is happening for my code as well. Have used the below settings

var aiOptions = new Microsoft.ApplicationInsights.AspNetCore.Extensions.ApplicationInsightsServiceOptions { InstrumentationKey = instrumentationKey, EnableAdaptiveSampling = false, EnableQuickPulseMetricStream = false, EnablePerformanceCounterCollectionModule = false, };

but still getting high number of threads and results in 100% cpu. See below for thread details. image

any updates would be helpful. In the meantime , will try 'AddAutoCollectedMetricExtractor' to false and monitory.

shoguns6 avatar Feb 22 '21 13:02 shoguns6

@shoguns6 It looks like you are reporting a different issue than the Flush deadlocks. Metrics aggregator (DefaultAggregationPeriodCycle) is only expected to create one Thread. If you are seeing more than one (461 in your screenshot), it likely means TelemetryConfig/Client is being recreated instead of re-using. Could you check if this is possible in your application.

cijothomas avatar Feb 25 '21 18:02 cijothomas

One more +1 as this was reported by another user.

cijothomas avatar Jul 09 '21 17:07 cijothomas

A tactical fix can be made to prevent Flush from waiting (Spinning) indefinitely. It should avoid the currently reported deadlock situations. As 2.18 is nearing stable release, adding this to the next milestone 2.19. The ETA for the stable release of 2.19 is not finalized, but it'd be before Nov 2021. The 1st beta containing this fix is expected in the very 1st beta of 2.19.

cijothomas avatar Jul 14 '21 14:07 cijothomas

Just another data point, we also have this problem. We have experienced it randomly since at least early 2021. Quite frustrating that it has not been possible to fix it even after 20 point releases of AI.

viblo avatar Mar 15 '22 08:03 viblo

Customer request to add this item to 2.21 release if possible.

toddfoust avatar Mar 16 '22 20:03 toddfoust

Same problem here. We had it two weeks ago for the first time using 2.18. We had is consistently multiple times during the day on different servers so we upgraded to 2.20 and yesterday this issue occured again. We have to leave Application Insights disabled for now, which is not ideal. Please include this in 2.21

BertscheS avatar Mar 23 '22 08:03 BertscheS

Hello All, We released v2.21-beta2 today which contains a fix for this issue. If anyone can test this new beta, please write back and share if the issue is resolved for you.

TimothyMothra avatar Jun 21 '22 16:06 TimothyMothra

Hello All, We released v2.21.0 today, which contains a fix for this issue. We're going to leave this issue open until we get confirmation that this issue has been resolved.

TimothyMothra avatar Jul 20 '22 17:07 TimothyMothra

We have created a backlog item to upgrade the SDK and test the fix.

ninlar avatar Jul 26 '22 20:07 ninlar

If anyone who experienced this issue can confirm that the latest version fixes the issue, it would be greatly appreciated.

cijothomas avatar Sep 16 '22 21:09 cijothomas

Apologies for the delay. We have this scheduled for the upcoming sprint, since we saw the deadlock again in integration. So we will upgrade to the new library.


From: Cijo Thomas @.> Sent: Friday, September 16, 2022 2:44 PM To: microsoft/ApplicationInsights-dotnet @.> Cc: [ɴιɴlαr] @.>; Comment @.> Subject: Re: [microsoft/ApplicationInsights-dotnet] TelemetryClient.Flush deadlocks (#1186)

If anyone who experienced this issue can confirm that the latest version fixes the issue, it would be greatly appreciated.

— Reply to this email directly, view it on GitHubhttps://github.com/microsoft/ApplicationInsights-dotnet/issues/1186#issuecomment-1249852836, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAIHCW3I6CHO43Q6NYIZQP3V6TS5RANCNFSM4IISHSYA. You are receiving this because you commented.Message ID: @.***>

ninlar avatar Sep 16 '22 21:09 ninlar

We have the latest version now running for 5 months, and the issue did not occur anymore. Thanks for looking into it.

BertscheS avatar Jan 11 '23 07:01 BertscheS