sentry-cocoa icon indicating copy to clipboard operation
sentry-cocoa copied to clipboard

Improve App Hangs detection

Open brustolin opened this issue 2 years ago • 4 comments

Description

With the current App Hangs timeout implementation, it is possible for an App Hangs to last "timeout * 1.9" without being reported.

See https://github.com/getsentry/sentry-cocoa/pull/1861#discussion_r890099824

Unity implementation

brustolin avatar Jun 07 '22 07:06 brustolin

This issue has gone three weeks without activity. In another week, I will close it.

But! If you comment or otherwise update it, I will reset the clock, and if you label it Status: Backlog or Status: In Progress, I will leave it alone ... forever!


"A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀

github-actions[bot] avatar Jun 30 '22 00:06 github-actions[bot]

This issue has gone three weeks without activity. In another week, I will close it.

But! If you comment or otherwise update it, I will reset the clock, and if you label it Status: Backlog or Status: In Progress, I will leave it alone ... forever!


"A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀

github-actions[bot] avatar Aug 12 '22 00:08 github-actions[bot]

@brustolin I really don't understand this one. Philip tried to explain it to me but maybe you can try to explain it?

kevinrenskers avatar Aug 12 '22 07:08 kevinrenskers

@brustolin I really don't understand this one. Philip tried to explain it to me but maybe you can try to explain it?

For sure. We talk about this in a meeting next week ;)

brustolin avatar Aug 12 '22 07:08 brustolin

Because the direct link to the comment doesn't work anymore, I'm just going to repeat what @vaind said in that PR:

Note: while this hasn't changed in this PR, I wanted to point out the issue with this approach (the same one is present in the Java watchdog AFAIK:

Sleeping for a given duration before checking a flag means the timeout is not really a "timeout", but merely a baselin, with the actual UI non-responsive time necessary for the ANR to be reported ranging between timeoutInterval & timeoutInterval * 2, averaging at about timeoutInterval * 1.5. That's because the following happens:

Assume timeoutInterval = 2 seconds, i.e. the current default. Also let's call blockExecutedOnMainThread just flag for brewity:

  1. flag= NO is executed and a block to reset is is scheduled
  2. say 0.1 second later, the flag = YES is executed on the main thread because UI still works
  3. say another 0.1 second later, UI hangs. Current time is start+0.2 seconds
  4. at start+2 seconds the watchdog thread wakes up, sees everything working just fine and looping. UI is still stuck at this point but flag == YES...
  5. in another iteration, set flag = NO
  6. now the UI starts being responsive after a total of 3.7 second unresponsiveness, with the current time being at start + 3.9 second, i.e. just before the next check. Sets flag = YES, since UI works now again.
  7. next check happens after the two-second sleep, all nice and shiny again because flag == YES - no ANR detected.

I'm not saying this is a huge deal, just pointing out the timeout isn't really a timeout... That's why I've changed this in the Unity implementation (after realizing this issue because my tests were spuriously failing), to executing the background iteration more often (5 times more often, currently) and counting the number of iterations (ticks) since the UI last responded. Then if the count reaches the predetermined count (5), the report is sent. Still not exact, but reduces the average ANR time before report from timetout*1.5 to timeout * 1.1, but more importantly, it also reduces the maximum time when an ANR isn't detected to about timeout*1.2 instead of timeout*1.9999...

kevinrenskers avatar Aug 16 '22 08:08 kevinrenskers