ApplicationInsights-Java icon indicating copy to clipboard operation
ApplicationInsights-Java copied to clipboard

Rate limit or filter logging and exceptions

Open stevendick-work opened this issue 3 years ago • 5 comments

Desired outcome

A way to limit misbehaving applications that are spamming the same logging or exceptions via the App Insights agent.

Context

We had an application generate 4 million exceptions in a 6 hour period overnight in our test environment that caused us to hit our data cap. This application's implementation is niave as it was stuck in a fail loop and the exceptions were all the same, but it's guaranteed that we're going to have misbehaving applications that don't do the right thing.

Not solutions

  • The Log Analytics workspace data cap is too crude a way of controlling this as one misbehaving application will consume the entire data cap and we're then blind on what's happening with the rest of the platform
  • Separate App Insights instances per team or app is too much effort for us given the number of teams and applications we manage

stevendick-work avatar Jun 23 '21 06:06 stevendick-work

Hey @stevendick-swissre, were the exception stack traces the primary problem, since they tend to be large?

What do you think about a simple rate limit on the capturing of exception stack traces, and when the rate limit is hit, we just stop collecting stack traces on exception telemetry?

(the advantage of continuing to collect the basic exception telemetry after the rate limit is hit, is that then the rate limit would not affect any backend metrics or Portal U/X, and so you could probably be more aggressive on the stack trace rate limit)

trask avatar Jun 27 '21 03:06 trask

That's a definite improvement on the current behaviour, but I don't see value in capturing exceptions without the stacktrace.

These 4 million exceptions in my original example are all identical, except for the timestamp when they happened. I calculate roughly 185 exceptions a second.

I'd rather have the option to rate limit an exception to once every 1/5/60 seconds and completely drop the duplicate exceptions.

stevendick-work avatar Jun 29 '21 06:06 stevendick-work

@stevendick-swissre can you send me email ([email protected])? It looks like you had another issue yesterday, and I want to make sure we get you sorted quickly.

trask avatar Jun 30 '21 16:06 trask

I believe there are some good hooks now in 3.4.0-BETA to address this issue:

  • rate limited sampling (to limit an unexpected spike in requests and associated telemetry)
  • exceptions are no longer captured on dependencies, since if the exceptions are important they will either get logged or bubble up all the way to the request where they will still be caught
  • sampling overrides can now be applied to log data

If more than this is needed, I'm thinking that maybe we should just have an option to just cut-off all telemetry if a certain threshold over a certain time period is exceeded (within a single JVM process), as just a safety valve.

trask avatar Sep 02 '22 22:09 trask

This looks interesting.

We finally managed to understand where to find the available attributes for the sampling overrides in the OpenTelemetry documentation.

While I knew that OpenTelemetry was an implementation detail of the App Insights agent, it wasn't clear that this was the specification being followed. The App Insights agent doc should make this clearer.

stevendick-work avatar Sep 05 '22 07:09 stevendick-work

Thanks for the feedback, we've updated the docs to explain where to find the available attributes for sampling overrides

https://learn.microsoft.com/en-us/azure/azure-monitor/app/java-standalone-sampling-overrides#span-attributes-available-for-sampling

trask avatar Nov 06 '22 16:11 trask

This issue has been automatically marked as stale because it has been marked as requiring author feedback but has not had any activity for 7 days. It will be closed if no further activity occurs within 7 days of this comment.

ghost avatar Nov 13 '22 20:11 ghost