sentry-javascript fix(node): Remove ambiguity and race conditions when matching local variables to exceptions

Closes #13415

@Bruno-DaSilva has been helping me explore different ways to fix a few fundamental issues with the local variables integration.

Bruno found a way to write to the error object from the debugger which removes any ambiguity over which variables go with which exception.

This allows us to remove the stack parsing and hashing which we were using previously to match up exceptions.

Rather than write the objectId to the error, I have used this to write the entire local variables array directly to the error object. This completely negates the need to post the local variables from the worker thread which removes any possibility of race conditions. We then later pull the local variables directly from hint.originalException.__SENTRY_ERROR_LOCAL_VARIABLES__.

Aug 28 '24 12:08 timfish

size-limit report 📦

⚠️ Warning: Base artifact is not the latest one, because the latest workflow run is not done yet. This may lead to incorrect results. Try to re-run all tests to get up to date results.

Path	Size	% Change	Change
@sentry/browser	22.52 KB	+0.02%	+4 B 🔺
@sentry/browser (incl. Tracing)	34.78 KB	+0.02%	+4 B 🔺
@sentry/browser (incl. Tracing, Replay)	71.22 KB	+0.01%	+5 B 🔺
@sentry/browser (incl. Tracing, Replay) - with treeshaking flags	61.66 KB	-4.39%	-2.83 KB 🔽
@sentry/browser (incl. Tracing, Replay with Canvas)	75.57 KB	+0.01%	+4 B 🔺
@sentry/browser (incl. Tracing, Replay, Feedback)	88.29 KB	+0.01%	+5 B 🔺
@sentry/browser (incl. Tracing, Replay, Feedback, metrics)	90.13 KB	+0.01%	+6 B 🔺
@sentry/browser (incl. metrics)	26.83 KB	+0.02%	+4 B 🔺
@sentry/browser (incl. Feedback)	39.6 KB	+0.02%	+5 B 🔺
@sentry/browser (incl. sendFeedback)	27.19 KB	+0.02%	+5 B 🔺
@sentry/browser (incl. FeedbackAsync)	31.9 KB	+0.02%	+4 B 🔺
@sentry/react	25.28 KB	+0.02%	+5 B 🔺
@sentry/react (incl. Tracing)	37.74 KB	+0.02%	+6 B 🔺
@sentry/vue	26.67 KB	+0.02%	+5 B 🔺
@sentry/vue (incl. Tracing)	36.61 KB	+0.02%	+5 B 🔺
@sentry/svelte	22.65 KB	+0.03%	+5 B 🔺
CDN Bundle	23.77 KB	+0.03%	+7 B 🔺
CDN Bundle (incl. Tracing)	36.49 KB	+0.02%	+6 B 🔺
CDN Bundle (incl. Tracing, Replay)	70.9 KB	+0.01%	+7 B 🔺
CDN Bundle (incl. Tracing, Replay, Feedback)	76.21 KB	+0.01%	+7 B 🔺
CDN Bundle - uncompressed	69.63 KB	+0.02%	+12 B 🔺
CDN Bundle (incl. Tracing) - uncompressed	108.21 KB	+0.02%	+12 B 🔺
CDN Bundle (incl. Tracing, Replay) - uncompressed	219.86 KB	+0.01%	+12 B 🔺
CDN Bundle (incl. Tracing, Replay, Feedback) - uncompressed	233.05 KB	+0.01%	+12 B 🔺
@sentry/nextjs (client)	37.51 KB	+0.01%	+3 B 🔺
@sentry/sveltekit (client)	35.35 KB	+0.02%	+6 B 🔺
@sentry/node	114.63 KB	-1.54%	-1.79 KB 🔽
@sentry/node - without tracing	88.24 KB	-1.98%	-1.78 KB 🔽
@sentry/aws-serverless	97.64 KB	-1.82%	-1.8 KB 🔽
@sentry/browser - with treeshaking flags	21.3 KB	added	added

View base workflow run

Aug 28 '24 12:08 github-actions[bot]

Ok so with the async debugger via worker, this results in a memory leak. I guess this needs some cleanup.

Aug 28 '24 12:08 timfish

Do we have to worry about object size with a large number of errors? Or a bunch of long lived error objects? Just trying to think if there's any edge cases with storing the local vars on the error objects themselves.

If we forsee any problem with that, then similar to the original approach it might be worth still using an extra datastructure to actually store the local variable state -- and force a maximum size (like with the LRU) that's configurable. And then have some sort of pointer to it on the error object instead (like a unique id).

re: memory leaks, I wonder if there's references to the error objects lingering around somewhere. I doubt the inspector api would write objects that are unable to be garbage collected, but I suppose it's possible if for some reason the inspector is holding references to the objects it made? 🤷

Aug 28 '24 17:08 Bruno-DaSilva

Do we have to worry about object size with a large number of errors?

It might be worth adding a limit regardless of how we're passing the variables.

re: memory leaks, I wonder if there's references

I've managed to fix the memory leak for the async version (Node > v19) by calling Runtime.releaseObject after the debugger has resumed. Still looking into the sync one which does leak, just less than the async one did.

Aug 28 '24 17:08 timfish

I've managed to fix the memory leak for the async version (Node > v19) by calling Runtime.releaseObject after the debugger has resumed.

Niiiiiiiice, this makes sense!

Aug 28 '24 17:08 Bruno-DaSilva

Oh it stopped the memory leak but also stopped it from working.

Back to the drawing board 😂

Aug 28 '24 19:08 timfish

I fixed the memory leak while keping the feature actually functioning by delaying the cleanup.

Interestingly, although the Node v22 memory leak test fails, the memory usage does drop off and stabilises after about 30 seconds:

{ i: 0, memUsage: 98.875 }
{ i: 1, memUsage: 106.546875 }
{ i: 2, memUsage: 110.484375 }
{ i: 3, memUsage: 113.21875 }
{ i: 4, memUsage: 116.078125 }
{ i: 5, memUsage: 121.390625 }
{ i: 6, memUsage: 137.21875 }
{ i: 7, memUsage: 149.046875 }
{ i: 8, memUsage: 163.6875 }
{ i: 9, memUsage: 165.71875 }
{ i: 10, memUsage: 136.046875 }
{ i: 11, memUsage: 136.25 }
{ i: 12, memUsage: 136.265625 }
{ i: 13, memUsage: 136.28125 }
{ i: 14, memUsage: 136.296875 }
{ i: 15, memUsage: 136.296875 }
{ i: 16, memUsage: 136.296875 }
{ i: 17, memUsage: 136.296875 }
{ i: 18, memUsage: 136.296875 }
{ i: 19, memUsage: 136.34375 }
{ i: 20, memUsage: 136.34375 }
{ i: 21, memUsage: 136.34375 }
{ i: 22, memUsage: 136.34375 }
{ i: 23, memUsage: 136.34375 }
{ i: 24, memUsage: 136.34375 }
{ i: 25, memUsage: 136.34375 }
{ i: 26, memUsage: 136.34375 }
{ i: 27, memUsage: 136.34375 }
{ i: 28, memUsage: 136.34375 }
{ i: 29, memUsage: 136.34375 }
{ i: 30, memUsage: 136.34375 }
{ i: 31, memUsage: 136.34375 }
{ i: 32, memUsage: 94.421875 }
{ i: 33, memUsage: 90.140625 }
{ i: 34, memUsage: 90.890625 }
{ i: 35, memUsage: 91.09375 }
{ i: 36, memUsage: 91.265625 }
{ i: 37, memUsage: 91.453125 }
{ i: 38, memUsage: 91.453125 }

While delaying the cleanup by a fixed 10 seconds does work, this has the potential to increase memory pressure and I'm not completely happy with it.

Aug 29 '24 17:08 timfish

I fixed the memory leak while keping the feature actually functioning by delaying the cleanup.

I'm glad that we've found a way that works. Something doesn't smell right to me about this though... The v8 inspector docs are quite light on what exactly these functions do (maybe worth perusing the v8 internal source)... but I would assume the Runtime.releaseObject simply releases the pointer the inspector is holding to the underlying object, to allow it to be garbage collected. Why would releasing that pointer cause the data we tried to save on the object to not be saved? Surely awaiting on Runtime.callFunctionOn should wait for the state to fully clear/write to fully finish? I know that's not what we saw but something smells fishy here.

What do you think is going on? You probably have a better idea than I do.

Aug 29 '24 18:08 Bruno-DaSilva

The v8 inspector docs are quite light on what exactly these functions do

Not only what the functions do, but how all the commands should be used together! You can get a reasonable idea by looking what others have done via Github search.

What do you think is going on? You probably have a better idea than I do.

Unfortunately not. Most that I've learnt has been through a lot of trial and error. The Chrome debugger has a protocol inspector so you can see the commands it's sending and learn a bit from that.

From what I've observed, Runtime.callFunctionOn modifies the object and then retains a reference to it which stops it from being GC'd. Runtime.releaseObject does release the reference but also appears to revert the changes. Without digging into the v8 debugger code, I don't really know why it does this or why you'd want it to work like this.

Aug 29 '24 18:08 timfish

Hey @timfish I just tried this branch in my local with and without delaying releaseObject and I get variables for both.

Can you double check on your end?

Aug 30 '24 05:08 Bruno-DaSilva

I'll be back looking at this on Monday evening on my way back from vacation.

The e2e test failures in CI suggested that the delay was required but I'll double check it.

Sep 01 '24 06:09 timfish

I just tried this branch in my local with and without delaying releaseObject and I get variables for both

While we get local variables without the delay in releasing, we still get a memory leak. It looks like releaseObject needs to be called after the debugger has resumed to avoid the memory leak.

Sep 04 '24 08:09 timfish

I just tried this branch in my local with and without delaying releaseObject and I get variables for both

While we get local variables without the delay in releasing, we still get a memory leak. It looks like releaseObject needs to be called after the debugger has resumed to avoid the memory leak.

Nice find. That sounds incredibly annoying to have found. This is awesome though, it makes the whole local variables integration much simpler and much more robust 💪

Sep 04 '24 08:09 Bruno-DaSilva

Plot twist...

While setImmediate is enough for Node v20, for v22 it releases too early and local variables are missing 😬

Sep 04 '24 08:09 timfish

Plot twist...

While setImmediate is enough for Node v20, for v22 it releases too early and local variables are missing 😬

Honestly, that is equal parts interesting and terrifying. Feels like we're relying on undefined behaviour somehow 😭

Sep 04 '24 09:09 Bruno-DaSilva

You know, if it were easy to do bisecting to the exact v8 engine version that introduces that behaviour change would probably reveal a lot about the internals of nodejs/the inspector/object releasing.

I suspect the best way to do that is to checkout nodejs@ v22.x and bisect through deps/v8. The node v20.x is on 11.3.244.4 and node v22.x is on 12.4.254.14.

But uhh, that's pretty time consuming for a probably-useless-except-for-knowledge outcome lol. Maaaybe it'd be a bug in v8, which would be cool i guess 🤷 . Or maybe this is just undefined behaviour with no guarantees because their api docs don't specify anything >.<

More just saying this out loud to myself. Would be neat but won't pay the bills :'(

Sep 04 '24 10:09 Bruno-DaSilva

Honestly, that is equal parts interesting and terrifying. Feels like we're relying on undefined behaviour somehow

Yeah, feels like this entire debug API is prone to race conditions.

The node v20.x is on 11.3.244.4 and node v22.x is on 12.4.254.14

Could try more node versions to find an exact version where it changes. If it ends up on a v8 upgrade, the changeset will be huge 😭

Maaaybe it'd be a bug in v8, which would be cool i guess 🤷

Even if we find a bug in v8, it's unlikely a fix would make it into an LTS Node version anyway. We'd need to cherry pick that fix and apply it to older v8 versions which might be too much work.

It could also be a bug in the inspector module although feels less likely.

Sep 04 '24 10:09 timfish

Gonna go ahead and merge this in.

Sep 10 '24 09:09 AbhiPrasad

sentry-javascript sentry-javascript copied to clipboard

fix(node): Remove ambiguity and race conditions when matching local variables to exceptions

size-limit report 📦

sentry-javascript
sentry-javascript copied to clipboard