sentry-cocoa
sentry-cocoa copied to clipboard
Improve App Hangs detection
Description
With the current App Hangs timeout implementation, it is possible for an App Hangs to last "timeout * 1.9" without being reported.
See https://github.com/getsentry/sentry-cocoa/pull/1861#discussion_r890099824
This issue has gone three weeks without activity. In another week, I will close it.
But! If you comment or otherwise update it, I will reset the clock, and if you label it Status: Backlog
or Status: In Progress
, I will leave it alone ... forever!
"A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀
This issue has gone three weeks without activity. In another week, I will close it.
But! If you comment or otherwise update it, I will reset the clock, and if you label it Status: Backlog
or Status: In Progress
, I will leave it alone ... forever!
"A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀
@brustolin I really don't understand this one. Philip tried to explain it to me but maybe you can try to explain it?
@brustolin I really don't understand this one. Philip tried to explain it to me but maybe you can try to explain it?
For sure. We talk about this in a meeting next week ;)
Because the direct link to the comment doesn't work anymore, I'm just going to repeat what @vaind said in that PR:
Note: while this hasn't changed in this PR, I wanted to point out the issue with this approach (the same one is present in the Java watchdog AFAIK:
Sleeping for a given duration before checking a flag means the timeout is not really a "timeout", but merely a baselin, with the actual UI non-responsive time necessary for the ANR to be reported ranging between timeoutInterval
& timeoutInterval * 2
, averaging at about timeoutInterval * 1.5
. That's because the following happens:
Assume timeoutInterval = 2 seconds, i.e. the current default. Also let's call blockExecutedOnMainThread
just flag
for brewity:
-
flag= NO
is executed and a block to reset is is scheduled - say 0.1 second later, the
flag = YES
is executed on the main thread because UI still works - say another 0.1 second later, UI hangs. Current time is
start+0.2 seconds
- at
start+2 seconds
the watchdog thread wakes up, sees everything working just fine and looping. UI is still stuck at this point butflag == YES
... - in another iteration, set
flag = NO
- now the UI starts being responsive after a total of 3.7 second unresponsiveness, with the current time being at
start + 3.9 second
, i.e. just before the next check. Setsflag = YES
, since UI works now again. - next check happens after the two-second sleep, all nice and shiny again because
flag == YES
- no ANR detected.
I'm not saying this is a huge deal, just pointing out the timeout isn't really a timeout... That's why I've changed this in the Unity implementation (after realizing this issue because my tests were spuriously failing), to executing the background iteration more often (5 times more often, currently) and counting the number of iterations (ticks
) since the UI last responded. Then if the count reaches the predetermined count (5), the report is sent. Still not exact, but reduces the average ANR time before report from timetout*1.5
to timeout * 1.1
, but more importantly, it also reduces the maximum time when an ANR isn't detected to about timeout*1.2
instead of timeout*1.9999...