sentry-javascript
sentry-javascript copied to clipboard
LocalVariables integration - race condition processing of cached frames
Is there an existing issue for this?
- [x] I have checked for existing issues https://github.com/getsentry/sentry-javascript/issues
- [x] I have reviewed the documentation https://docs.sentry.io/
- [x] I am using the latest SDK release https://github.com/getsentry/sentry-javascript/releases
How do you use Sentry?
Sentry Saas (sentry.io)
Which SDK are you using?
@sentry/node
SDK Version
8.26.0
Framework Version
Node v20.12.2
Link to Sentry event
First error's local variables (endpoint fees/33/fee) are blank: https://clearbanc.sentry.io/issues/5724968061/events/13ac4e69d73e4cb6b79160830ef972f7/
First error's local variables are on the SECOND error's call (endpoint fees/44/fee) https://clearbanc.sentry.io/issues/5724968061/events/f697d4fe012e408d8fea4241d6d3c786/
Reproduction Example/SDK Setup
SDK setup:
Sentry.init({
dsn: process.env.SENTRY_DSN,
includeLocalVariables: true,
});
Unfortunately because it's a race condition it may be hard to reproduce without very specific code, error behaviour, and hardware. Hopefully you can trust that it does happen as an edge case and we can find some way to gate frame checks to wait for the debugger or something?
Steps to Reproduce
- startup the nodejs app fresh
- init sentry with
includeLocalVariables: true - trigger/capture an exception with local var state A
- trigger/capture an exception with local var state B
Expected Result
Sentry UI should show exception #1 with state A and exception #2 with state B
Actual Result
Sentry UI shows exception #1 with NO state, and exception #2 with state A.
I dug into this and it appears there is a race condition with cachedFrames where we check the LRU cache for frames before the worker sends the frame info message to populate the cache. So then the second time the error is captured, it will receive the previous frame's information that was never consumed the first time. And then the worker sends the second frame's late, repeating ad nauseum.
I'm hoping with the digging I've done to root cause these it shouldn't require a high effort repro example -- but if it's needed I can create one (it'll just take me some time that I don't immediately have).
For some clarity/context:
This happens inconsistently but often in my application - I notice when on first startup, the first time an error is captured it will sometimes NOT have local variables attached to it.
Some extra logging I added shows that this is because the worker sends the callFrames after the exception is processed - a race condition.
To make this more easy to test/reproduce, you can add a temporary sleep to the worker before it sends the variables over.
You can see the result in our sentry instance here:
-
First error's local variables (endpoint
fees/33/fee) are blank: https://clearbanc.sentry.io/issues/5724968061/events/13ac4e69d73e4cb6b79160830ef972f7/ -
First error's local variables are on the SECOND error's call (endpoint
fees/44/fee) https://clearbanc.sentry.io/issues/5724968061/events/f697d4fe012e408d8fea4241d6d3c786/
Personally, I'm thinking there should be some mechanism where the exception capturing half will wait for the worker to report on the exception. There's some problems with that though:
- What if an error is
new Error()'d but never thrown? then we will wait for the worker which will never trigger on a thrown exception. - How can we differentiate between the debugger/variable info from last error that happened, and the latest instance? Both have the same frame hash.
- How does sampling errors play into this? Will a sentry exception that is not sampled ever go through the local variable flow, to pop from the LRU? If not, then stale entries might stick around in the LRU and picked up by the next error that IS sampled.
- How to handle local variable rate limiting, where we stop reporting for all exceptions?
So, maybe an even more ideal mechanism comes if we have a way of somehow linking the sentry exceptions with the exact debugger frame related to it, instead of just hoping they match. Then we don't have this weird additional behaviour with frames colliding.
Thoughts?
Hey @Bruno-DaSilva thanks for writing in!
We'll look into your issue next week as this week is Hackweek at Sentry (see #13421).
I think you're right, the simplest solution is to delay before checking for local variables and we can include a timeout.
@timfish one wrench in that I'm thinking about is it may still have a race condition, albeit a smaller one: what if two similar requests/exceptions come in at once and both hit the debugger? Both exceptions will save in the LRU in the same key so only one will persist, and then the delay will finish and it will be attached to just one of the two Sentry.captureExceptions, right?
What do you think? It's all rooted in the fact we can't match the thrown exceptions to their exact parent request from Sentry.captureException.
Both exceptions will save in the LRU in the same key so only one will persist, and then the delay will finish and it will be attached to just one of the two
Yeah it's tricky. We could put the exception variables in a stack and only read from one end and we'd be more sure that we're getting them in the correct order. This would likely be reliable if we only had to consider uncaught exceptions. However, for caught exceptions there's nothing stopping users from calling Sentry.captureException after another exception has occurred/been caught so we can't really rely on ordering.
I wonder if we could generate+inject a unique uuid on exception creation or exception throw that could then be used to match? I can't immediately think of a way of doing that for every possible exception without patching node internals though...
EDIT: could we use the memory address for this (or perhaps some other very unique, unchanging id that might exist in an exception) for the exception?
I wonder if we could generate+inject a unique uuid on exception creation or exception throw that could then be used to match?
From the debugger we can modify almost anything in the app state so I wonder if we could write a unique ID property the error object which we could then read off later in runtime as it passes through the Sentry SDK 🤔
I wonder if we could write a unique ID property the error object
So I don't think we can't do this at the point when the debugger is initially paused. We'd need to step further on in execution until the error object is part of the scope. This would again open us up to race conditions with multiple exceptions occurring closely together.
One (terrible) option is to monkeypatch global.Error. But... that perhaps is too scary and dangerous.
From the debugger we can modify almost anything in the app state so I wonder if we could write a unique ID property the error object which we could then read off later in runtime as it passes through the Sentry SDK 🤔
@timfish I think I found a way to do what you were looking for. Check this out:
-
Worker receives a pause event with
data.objectIddefined. Turns out, this objectId is for the Error object 🎉 -
Worker calls function on the object to set a new property
await session.post('Runtime.callFunctionOn', { functionDeclaration: 'function() { this.brunosId = "asfg"; }', objectId: data.objectId });
- exception object grabbed by
sentry.captureExceptioncontainsthis.brunoIdwhich is a new property written in step 2, so long as it waits long enough for step 2 to finish
What do you think?