sentry-javascript
sentry-javascript copied to clipboard
Node ANR tracking not grouping well with GraphQL + Apollo
Is there an existing issue for this?
- [X] I have checked for existing issues https://github.com/getsentry/sentry-javascript/issues
- [X] I have reviewed the documentation https://docs.sentry.io/
- [X] I am using the latest SDK release https://github.com/getsentry/sentry-javascript/releases
How do you use Sentry?
Sentry Saas (sentry.io)
Which SDK are you using?
@sentry/node
SDK Version
7.74.0
Framework Version
No response
Link to Sentry event
No response
SDK Setup
No response
Steps to Reproduce
- Set up graphql + apollo app
- Enable Sentry SDK with ANR tracking
- Throw ANR error from graphql resolver
Expected Result
Everything is grouped well
Actual Result
Something that didn't work so well was that we got a new issue every time the problem with big GraphQL response occurred. It is probably related to the fact that the stack trace just contains Apollo server related code. I had to create a custom fingerprint rule to group the issues.
From @ollipa, raised as part of https://github.com/getsentry/sentry-javascript/issues/9307
ApplicationNotResponding: Application Not Responding for at least 1000 ms
File "app:///../node_modules/graphql/execution/execute.js", line 412, col 5, in executeFields
File "app:///../node_modules/graphql/execution/execute.js", line 925, col 10, in completeObjectValue
File "app:///../node_modules/graphql/execution/execute.js", line 646, col 12, in completeValue
File "app:///../node_modules/graphql/execution/execute.js", line 595, col 23, in completeValue
File "app:///../node_modules/graphql/execution/execute.js", line 707, col 25
File "app:///../node_modules/graphql/execution/execute.js", line 687, col 34, in completeListValue
File "app:///../node_modules/graphql/execution/execute.js", line 618, col 12, in completeValue
File "app:///../node_modules/graphql/execution/execute.js", line 595, col 23, in completeValue
File "app:///../node_modules/graphql/execution/execute.js", line 500, col 19, in executeField
File "app:///../node_modules/graphql/execution/execute.js", line 414, col 22, in executeFields
File "app:///../node_modules/graphql/execution/execute.js", line 925, col 10, in completeObjectValue
File "app:///../node_modules/graphql/execution/execute.js", line 646, col 12, in completeValue
File "app:///../node_modules/graphql/execution/execute.js", line 595, col 23, in completeValue
File "app:///../node_modules/graphql/execution/execute.js", line 707, col 25
File "app:///../node_modules/graphql/execution/execute.js", line 687, col 34, in completeListValue
File "app:///../node_modules/graphql/execution/execute.js", line 618, col 12, in completeValue
File "app:///../node_modules/graphql/execution/execute.js", line 595, col 23, in completeValue
File "app:///../node_modules/graphql/execution/execute.js", line 500, col 19, in executeField
File "app:///../node_modules/graphql/execution/execute.js", line 414, col 22, in executeFields
File "app:///../node_modules/graphql/execution/execute.js", line 925, col 10, in completeObjectValue
File "app:///../node_modules/graphql/execution/execute.js", line 806, col 10, in completeAbstractValue
File "app:///../node_modules/graphql/execution/execute.js", line 635, col 12, in completeValue
File "app:///../node_modules/graphql/execution/execute.js", line 595, col 23, in completeValue
File "app:///../node_modules/graphql/execution/execute.js", line 707, col 25
File "app:///../node_modules/graphql/execution/execute.js", line 687, col 34, in completeListValue
File "app:///../node_modules/graphql/execution/execute.js", line 618, col 12, in completeValue
File "app:///../node_modules/graphql/execution/execute.js", line 595, col 23, in completeValue
File "app:///../node_modules/graphql/execution/execute.js", line 500, col 19, in executeField
File "app:///../node_modules/graphql/execution/execute.js", line 414, col 22, in executeFields
File "app:///../node_modules/graphql/execution/execute.js", line 925, col 10, in completeObjectValue
File "app:///../node_modules/graphql/execution/execute.js", line 646, col 12, in completeValue
File "app:///../node_modules/graphql/execution/execute.js", line 595, col 23, in completeValue
File "app:///../node_modules/graphql/execution/execute.js", line 497, col 9
File "node:internal/process/task_queues", line 95, col 5, in processTicksAndRejections
The recursion might be causing issues here (varying amount of frames even though root cause is the same).
Is there something we can add/improve in the server-side fingerprinting to help with this?
Or maybe this stack trace rule is causing this? https://github.com/getsentry/sentry/blob/62a16a2dcb1a5e31314036bae969988ce69dcd94/src/sentry/grouping/enhancer/enhancement-configs/newstyle%402023-01-11.txt#L70-L76
# Android ANR: Exception stack is a snapshot of the UI/main thread. The
# *outermost* in-app frame is most indicative of which user action has led to ANR,
# and that's what we want to group by. (innermost=crashing frame)
#
# Note: Newer Android SDKs send the snapshot flag with ANRs, so this rule is
# not strictly necessary.
error.mechanism:ANR invert-stacktrace=1
Is there something we can add/improve in the server-side fingerprinting to help with this
Not sure, I'm not too familiar with our grouping code here. Let me ping and ask.
Are there any known error types that should not use the stack trace for grouping? Is this more a node issue than graphql?
Is this custom fingerprint sufficient to fix the problem?
error.type:RangeError runtime.name:node -> some_fingerprint
@ollipa @timfish does this work for you?
As a fix on Sentry, we could add an internal fingerprint (we do not have support for this) rather than expecting customers to do so.
This is a node issue but it might affect other platforms since it's entirely related to how App Not Responding errors are captured. When we detect that the main event loop is blocked, we pause via the debugger interface and capture a stack trace. The exact stack trace will depend on where exactly the debugger pauses but there will usually be a common part of the stack trace.
For example, take this code:
function longRunningProcess1 {
// blocking that takes 50ms to complete
}
function longRunningProcess12{
// blocking that takes 25ms to complete
}
function longWork() {
for (let i = 0; i < 20; i++) {
longRunningProcess1();
longRunningProcess2();
}
}
longWork();
longWork will block for 1500ms and our ANR detection might capture a stack trace somewhere between 1000-1050ms. Depending on the timing we will pause in longRunningProcess1 maybe 66% of the time, longRunningProcess2 33% of the time and very occasionally we'll pause in longWork. This causes the events to be in 3 separate groups but these should probably all be grouped under longWork as this is the common "ancestor".
The GraphQL + Apollo grouping issue above appears to be caused by the fact that the long running blocking process we stop and capturing a stack trace from is recursive. The stack traces do have a common ancestor but it's distance to the top of the stack varies.
Going to close this for now given we haven't made much progress.