ApplicationInsights-dotnet icon indicating copy to clipboard operation
ApplicationInsights-dotnet copied to clipboard

Flush doesn't block until messages are actually flushed

Open DamianEdwards opened this issue 8 years ago • 47 comments
trafficstars

TelemetryClient.Flush() doesn't actually block returning until the messages are flushed. This means if you're want to flush messages during application shutdown (like in Microsoft/ApplicationInsights-aspnetcore#304) then you have to add a Thread.Sleep(1000) which isn't ideal.

It would be nice if there was a blocking version of this API that took an optional timeout, such that it only blocked for the time necessary to flush the events, or the timeout is hit, so that application shutdown doesn't have to be blocked any longer than necessary.

DamianEdwards avatar Jan 10 '17 23:01 DamianEdwards

If the channel used is InMemoryChannel, then Flush() is blocking. But AI documentation suggests invoking Thread.Sleep() after flush during app shutdown.

We can modify the API to have both sync and async versions of the Flush() api, and update documentation accordingly.

cijothomas avatar Jan 18 '17 17:01 cijothomas

@DamianEdwards Did you use InMemoryChannel? If yes, Flush() should be blocking and there would be no need of waiting afterwards. But for ServerTelemetryChannel, Flush() is async, and need to do some sleep after calling it.

cijothomas avatar Jan 19 '17 01:01 cijothomas

We're seeing the same thing with our WebJobs/Functions implementation. We're looking for a reliable way to flush the client when our host exits.

brettsam avatar Apr 25 '17 16:04 brettsam

This is really important IMHO. Exceptions that are about to crash your process (and hence require flushing) are arguably the most important ones to track.

In addition, it seems like the current implementation falls between two stools - on one hand it's synchronous so you block your thread (https://github.com/Microsoft/ApplicationInsights-dotnet/issues/482), but on the other hand delivery isn't guaranteed so it might as well have been asynchronous (ideally returning a Task you could wait on).

Like the OP said, as things stand now you have to add some random Thread.Sleep(N) where of course there is really no way to know what N should be...

ohadschn avatar Jul 26 '17 21:07 ohadschn

@cijothomas @SergeyKanzhelev I wonder if something like this could be used as a workaround for the case of an exception about to crash the process?

try
{
    // something important
}
catch (Exception fatal)
{
    // retain original channel
    var serverChannel = TelemetryConfiguration.Active.TelemetryChannel;

    // switch to InMemoryChannel for guaranteed delivery of the offending exception
    TelemetryConfiguration.Active.TelemetryChannel = new InMemoryChannel();
    telemetryClient.TrackException(fatal);
    telemetryClient.Flush();

    // now that the most important part was tracked, optimistically try and drain the queue
    serverChannel.Flush();
    Thread.Sleep(1000);

    // shutdown/crash process
    throw;
}

ohadschn avatar Jul 26 '17 23:07 ohadschn

@ohadschn this is what we are doing here: https://github.com/Microsoft/ApplicationInsights-dotnet-server/blob/75373e57dcf8d646c54ee188461c373f2cc98939/Src/WindowsServer/WindowsServer.Shared/UnhandledExceptionTelemetryModule.cs#L71-L86 So you can just use unhandled exceptions telemetry module

SergeyKanzhelev avatar Jul 26 '17 23:07 SergeyKanzhelev

@SergeyKanzhelev This is great stuff, I'll make sure we use it. However, in the Service Fabric (SF) case I believe the module won't be enough as SF catches RunAsync exceptions and calls Environment.FailFast which won't trigger the unhandled exception handlers. So I basically copied the handler code to my top-most try/catch.

ohadschn avatar Jul 27 '17 20:07 ohadschn

We also just ran into this - we lost some application crash exceptions. 😢

The current behavior doesn't seem intuitive/expected to me so I'd say this should be treated as a "bug" instead of an "enhancement".

Also, it's the most wanted issue in this repo, so it would be great if this could get some love soon!

cwe1ss avatar Jul 28 '17 18:07 cwe1ss

@cwe1ss note that specifically for application crashes there is a pretty simple workaround, see Sergey's last comment for the code. Basically initialize a new TelemetryClient with an InMemoryChannel, use it for your TrackException, and finally Flush it. Delivery should be guaranteed (as long as the process didn't crash before flushing was complete of course).

ohadschn avatar Jul 28 '17 19:07 ohadschn

@ohadschn yes, we will use this workaround for now.

But still, I think this is a very important issue as it's just frustrating for everyone who doesn't know about this behavior - which probably will be almost everyone.

I think it would be even better if TelemetryClient implemented IDisposable and that should automatically Flush as well (with some timeout of course). No one likes lost exceptions. This would also play nicer with dependency injection tools like Microsoft.Extensions.DependencyInjection that automatically dispose their services.

cwe1ss avatar Jul 28 '17 19:07 cwe1ss

@cwe1ss In AI's defense, almost everyone will be covered by the unhandled exception module which is enabled by default in Microsoft.ApplicationInsights.WindowsServer and Microsoft.ApplicationInsights.Web (which depends on the former), by far the most prevalent AI packages (according to nuget.org statistics). Of course there are still many users of other packages, such as ASP.NET Core where this module won't be enabled by default. And there are cases (like Service Fabric) where unhandled exceptions won't reach the top stack frame and thus won't be caught by the module (plus there's no special exception handling in the AI SF package). In such cases, it's the developer's responsibility to implement this workaround.

As for TelemetryClient : IDisposable it makes sense for sure, though keep in mind that many developers probably have static telemetry clients (indeed, the official docs recommend "an instance of TelemetryClient for each module of your app") where it won't help much. Maybe AI should hook AppDomain.ProcessExit...

ohadschn avatar Jul 29 '17 01:07 ohadschn

True. I'm in the latter group (ASP.NET Core, Service Fabric) so I'll probably just get hit by a lot more of these issues :(

cwe1ss avatar Jul 29 '17 08:07 cwe1ss

@cwe1ss I am in the same group, so let's keep in touch if we run into related issues. In the meantime, might I suggest you upvote some of the issues here (and create new ones as applicable): https://github.com/Microsoft/ApplicationInsights-servicefabric/issues, https://github.com/Microsoft/ApplicationInsights-aspnetcore/issues.

ohadschn avatar Jul 29 '17 12:07 ohadschn

I really want to see a blocking flush as well. We are using MSAI in a Win32 application via a C++/CLR library and would like to ensure that when the application exits we can ensure the buffered and persisted events are sent.

Laibalion avatar Dec 21 '17 13:12 Laibalion

Stumbled upon this thread while considering adding AI to our Windows Service app. A Flush() that may return before messages are actually flushed to AI is misleading to me. In considering AI, my thoughts were:

  • I'd like to use AI; I wonder if it'll block for network on every Track* call. That would be bad.
  • Oh, there's a Flush() method, so it must be batching stuff. Good!
  • That means I'll have to call Flush() myself right before the app closes.

This thread seems to say that calling Flush() probably won't accomplish that goal. @ohadschn mentioned, "In such cases, it's the developer's responsibility to implement this workaround," and my response to that is: "The developer's responsibility would be to call Flush(). Upon successful return of Flush() they have a reasonable expectation that they're not going to lose any AI messages."

That's my perspective on it, anyway. I agree with @cwe1ss that this is more of a bug than enhancement, and wanted to add a comment because the thread seems to be going the way of, "There's maybe some value here, but it's not really broken, and devs are responsible to use a [non-obviouse] workaround." I hope the perspective of someone new to AI helps.

pettys avatar Apr 26 '18 16:04 pettys

The Flush() should be synchronous. A user can easily turn a synchronous call into an asynchronous one but not the other way around. Right now there is no way to guarantee that all messages are delivered.

The mentioning of Thread.Sleep(), even on Microsoft code samples, looks really unprofessional.

alucao avatar May 25 '18 07:05 alucao

I up-vote this issue too. It is not intuitive for the async Flush to be named Flush instead of FlushAsync, let alone that we lack sync implementation too :(.

MirandaDotNet avatar May 25 '18 09:05 MirandaDotNet

This is what happens when you give the intern too much responsibility 🤣

nickmkk avatar Nov 27 '18 17:11 nickmkk

This should be the API:

Task FlushAsync(CancellationToken);

You can await it, Wait it or cancel it.

paulomorgado avatar Mar 07 '19 16:03 paulomorgado

@paulomorgado that comment should go on this other issue: https://github.com/Microsoft/ApplicationInsights-dotnet/issues/482

RobJellinghaus avatar Mar 07 '19 16:03 RobJellinghaus

Hi all, has this Flush() problem been fixed in the latest AI sdk release? We ran into the same problem with version 2.10.

Metrowerks avatar Oct 18 '19 20:10 Metrowerks

nothing has changed in flush in this release. I expect some update by end of year.

cijothomas avatar Oct 18 '19 20:10 cijothomas

I cloned the master branch from here and look into the source code of TelemetryClient.Flush() and looks like all the internal implementations are synchronous. Were there any reason why the code was not released yet or have I missed something inside Flush() that still makes it async?

Metrowerks avatar Oct 18 '19 20:10 Metrowerks

I cloned the master branch from here and look into the source code of TelemetryClient.Flush() and looks like all the internal implementations are synchronous. Were there any reason why the code was not released yet or have I missed something inside Flush() that still makes it async?

InMemoryChannel is sync. Its the ServerTelemetryChannel which is not sync. It pushes items from in-memory queue to a Transmitor. The transmittor has its own queue of In-Flight transmissions, in-memory transmissions, in-disk transmissions - these are not flushed sync.

cijothomas avatar Nov 15 '19 02:11 cijothomas

@cijothomas Could you look at this, please? I looked inside the ServerTelemetryChannel, and it seems it actually does wait for the asynchronous task to finish:

https://github.com/microsoft/ApplicationInsights-dotnet/blob/7633ae849edc826a8547745b6bf9f3174715d4bd/BASE/src/ServerTelemetryChannel/ServerTelemetryChannel.cs#L289-L298

What am I missing?

tompazourek avatar Jan 17 '20 09:01 tompazourek

That still doesn't guarantee that the events are sent to the ingestion endpoint. The ServerTelemetryChannel has a Transmitter with its own buffering. TelemetryBuffer.FlushAsync() merely ensures that events are handed off to the Transmitter.

pharring avatar Jan 17 '20 16:01 pharring

Any ETA on when this will be fixed?

rongduan-zhu avatar Jul 02 '20 08:07 rongduan-zhu

Indeed, when will this be made possible. We are still using version 1.2.3 because it's the last version supporting a synchronous flush.

Laibalion avatar Jul 02 '20 10:07 Laibalion

Indeed, when will this be made possible. We are still using version 1.2.3 because it's the last version supporting a synchronous flush.

I don't know much about 1.2.3 version - but its quite old and not supported anymore. To get sync Flush, InMemoryChannel can be used.

We started working on fixing this issue, but hit several challenges. Its still an open PR. Don't have ETA on when we'll be able to merge it.

cijothomas avatar Jul 02 '20 15:07 cijothomas

Indeed, when will this be made possible. We are still using version 1.2.3 because it's the last version supporting a synchronous flush.

I don't know much about 1.2.3 version - but its quite old and not supported anymore. To get sync Flush, InMemoryChannel can be used.

We started working on fixing this issue, but hit several challenges. Its still an open PR. Don't have ETA on when we'll be able to merge it.

What does 'not supported' mean? All of our applications that use this version are happily send telemetry data to our endpoint on Azure.

Laibalion avatar Jul 02 '20 16:07 Laibalion