sentry-javascript icon indicating copy to clipboard operation
sentry-javascript copied to clipboard

LocalVariables integration - race condition processing of cached frames

Open Bruno-DaSilva opened this issue 1 year ago • 11 comments

Is there an existing issue for this?

  • [x] I have checked for existing issues https://github.com/getsentry/sentry-javascript/issues
  • [x] I have reviewed the documentation https://docs.sentry.io/
  • [x] I am using the latest SDK release https://github.com/getsentry/sentry-javascript/releases

How do you use Sentry?

Sentry Saas (sentry.io)

Which SDK are you using?

@sentry/node

SDK Version

8.26.0

Framework Version

Node v20.12.2

Link to Sentry event

First error's local variables (endpoint fees/33/fee) are blank: https://clearbanc.sentry.io/issues/5724968061/events/13ac4e69d73e4cb6b79160830ef972f7/

First error's local variables are on the SECOND error's call (endpoint fees/44/fee) https://clearbanc.sentry.io/issues/5724968061/events/f697d4fe012e408d8fea4241d6d3c786/

Reproduction Example/SDK Setup

SDK setup:

Sentry.init({
    dsn: process.env.SENTRY_DSN,
    includeLocalVariables: true,
});

Unfortunately because it's a race condition it may be hard to reproduce without very specific code, error behaviour, and hardware. Hopefully you can trust that it does happen as an edge case and we can find some way to gate frame checks to wait for the debugger or something?

Steps to Reproduce

  1. startup the nodejs app fresh
  2. init sentry with includeLocalVariables: true
  3. trigger/capture an exception with local var state A
  4. trigger/capture an exception with local var state B

Expected Result

Sentry UI should show exception #1 with state A and exception #2 with state B

Actual Result

Sentry UI shows exception #1 with NO state, and exception #2 with state A.

I dug into this and it appears there is a race condition with cachedFrames where we check the LRU cache for frames before the worker sends the frame info message to populate the cache. So then the second time the error is captured, it will receive the previous frame's information that was never consumed the first time. And then the worker sends the second frame's late, repeating ad nauseum.

Bruno-DaSilva avatar Aug 17 '24 23:08 Bruno-DaSilva

I'm hoping with the digging I've done to root cause these it shouldn't require a high effort repro example -- but if it's needed I can create one (it'll just take me some time that I don't immediately have).

Bruno-DaSilva avatar Aug 18 '24 00:08 Bruno-DaSilva

For some clarity/context:

This happens inconsistently but often in my application - I notice when on first startup, the first time an error is captured it will sometimes NOT have local variables attached to it.

Some extra logging I added shows that this is because the worker sends the callFrames after the exception is processed - a race condition.

To make this more easy to test/reproduce, you can add a temporary sleep to the worker before it sends the variables over.

You can see the result in our sentry instance here:

  • First error's local variables (endpoint fees/33/fee) are blank: https://clearbanc.sentry.io/issues/5724968061/events/13ac4e69d73e4cb6b79160830ef972f7/

  • First error's local variables are on the SECOND error's call (endpoint fees/44/fee) https://clearbanc.sentry.io/issues/5724968061/events/f697d4fe012e408d8fea4241d6d3c786/


Personally, I'm thinking there should be some mechanism where the exception capturing half will wait for the worker to report on the exception. There's some problems with that though:

  1. What if an error is new Error()'d but never thrown? then we will wait for the worker which will never trigger on a thrown exception.
  2. How can we differentiate between the debugger/variable info from last error that happened, and the latest instance? Both have the same frame hash.
  3. How does sampling errors play into this? Will a sentry exception that is not sampled ever go through the local variable flow, to pop from the LRU? If not, then stale entries might stick around in the LRU and picked up by the next error that IS sampled.
  4. How to handle local variable rate limiting, where we stop reporting for all exceptions?

So, maybe an even more ideal mechanism comes if we have a way of somehow linking the sentry exceptions with the exact debugger frame related to it, instead of just hoping they match. Then we don't have this weird additional behaviour with frames colliding.

Thoughts?

Bruno-DaSilva avatar Aug 18 '24 14:08 Bruno-DaSilva

Hey @Bruno-DaSilva thanks for writing in!

We'll look into your issue next week as this week is Hackweek at Sentry (see #13421).

Lms24 avatar Aug 19 '24 09:08 Lms24

I think you're right, the simplest solution is to delay before checking for local variables and we can include a timeout.

timfish avatar Aug 27 '24 10:08 timfish

@timfish one wrench in that I'm thinking about is it may still have a race condition, albeit a smaller one: what if two similar requests/exceptions come in at once and both hit the debugger? Both exceptions will save in the LRU in the same key so only one will persist, and then the delay will finish and it will be attached to just one of the two Sentry.captureExceptions, right?

What do you think? It's all rooted in the fact we can't match the thrown exceptions to their exact parent request from Sentry.captureException.

Bruno-DaSilva avatar Aug 27 '24 15:08 Bruno-DaSilva

Both exceptions will save in the LRU in the same key so only one will persist, and then the delay will finish and it will be attached to just one of the two

Yeah it's tricky. We could put the exception variables in a stack and only read from one end and we'd be more sure that we're getting them in the correct order. This would likely be reliable if we only had to consider uncaught exceptions. However, for caught exceptions there's nothing stopping users from calling Sentry.captureException after another exception has occurred/been caught so we can't really rely on ordering.

timfish avatar Aug 27 '24 16:08 timfish

I wonder if we could generate+inject a unique uuid on exception creation or exception throw that could then be used to match? I can't immediately think of a way of doing that for every possible exception without patching node internals though...

EDIT: could we use the memory address for this (or perhaps some other very unique, unchanging id that might exist in an exception) for the exception?

Bruno-DaSilva avatar Aug 27 '24 16:08 Bruno-DaSilva

I wonder if we could generate+inject a unique uuid on exception creation or exception throw that could then be used to match?

From the debugger we can modify almost anything in the app state so I wonder if we could write a unique ID property the error object which we could then read off later in runtime as it passes through the Sentry SDK 🤔

timfish avatar Aug 27 '24 16:08 timfish

I wonder if we could write a unique ID property the error object

So I don't think we can't do this at the point when the debugger is initially paused. We'd need to step further on in execution until the error object is part of the scope. This would again open us up to race conditions with multiple exceptions occurring closely together.

timfish avatar Aug 27 '24 17:08 timfish

One (terrible) option is to monkeypatch global.Error. But... that perhaps is too scary and dangerous.

Bruno-DaSilva avatar Aug 27 '24 22:08 Bruno-DaSilva

From the debugger we can modify almost anything in the app state so I wonder if we could write a unique ID property the error object which we could then read off later in runtime as it passes through the Sentry SDK 🤔

@timfish I think I found a way to do what you were looking for. Check this out:

  1. Worker receives a pause event with data.objectId defined. Turns out, this objectId is for the Error object 🎉 Image

  2. Worker calls function on the object to set a new property

await session.post('Runtime.callFunctionOn', { functionDeclaration: 'function() { this.brunosId = "asfg"; }', objectId: data.objectId });
  1. exception object grabbed by sentry.captureException contains this.brunoId which is a new property written in step 2, so long as it waits long enough for step 2 to finish Image

What do you think?

Bruno-DaSilva avatar Aug 27 '24 23:08 Bruno-DaSilva