sentry-javascript
sentry-javascript copied to clipboard
fix(node): Remove ambiguity and race conditions when matching local variables to exceptions
Closes #13415
@Bruno-DaSilva has been helping me explore different ways to fix a few fundamental issues with the local variables integration.
Bruno found a way to write to the error object from the debugger which removes any ambiguity over which variables go with which exception.
This allows us to remove the stack parsing and hashing which we were using previously to match up exceptions.
Rather than write the objectId to the error, I have used this to write the entire local variables array directly to the error object. This completely negates the need to post the local variables from the worker thread which removes any possibility of race conditions. We then later pull the local variables directly from hint.originalException.__SENTRY_ERROR_LOCAL_VARIABLES__.
size-limit report 📦
⚠️ Warning: Base artifact is not the latest one, because the latest workflow run is not done yet. This may lead to incorrect results. Try to re-run all tests to get up to date results.
| Path | Size | % Change | Change |
|---|---|---|---|
| @sentry/browser | 22.52 KB | +0.02% | +4 B 🔺 |
| @sentry/browser (incl. Tracing) | 34.78 KB | +0.02% | +4 B 🔺 |
| @sentry/browser (incl. Tracing, Replay) | 71.22 KB | +0.01% | +5 B 🔺 |
| @sentry/browser (incl. Tracing, Replay) - with treeshaking flags | 61.66 KB | -4.39% | -2.83 KB 🔽 |
| @sentry/browser (incl. Tracing, Replay with Canvas) | 75.57 KB | +0.01% | +4 B 🔺 |
| @sentry/browser (incl. Tracing, Replay, Feedback) | 88.29 KB | +0.01% | +5 B 🔺 |
| @sentry/browser (incl. Tracing, Replay, Feedback, metrics) | 90.13 KB | +0.01% | +6 B 🔺 |
| @sentry/browser (incl. metrics) | 26.83 KB | +0.02% | +4 B 🔺 |
| @sentry/browser (incl. Feedback) | 39.6 KB | +0.02% | +5 B 🔺 |
| @sentry/browser (incl. sendFeedback) | 27.19 KB | +0.02% | +5 B 🔺 |
| @sentry/browser (incl. FeedbackAsync) | 31.9 KB | +0.02% | +4 B 🔺 |
| @sentry/react | 25.28 KB | +0.02% | +5 B 🔺 |
| @sentry/react (incl. Tracing) | 37.74 KB | +0.02% | +6 B 🔺 |
| @sentry/vue | 26.67 KB | +0.02% | +5 B 🔺 |
| @sentry/vue (incl. Tracing) | 36.61 KB | +0.02% | +5 B 🔺 |
| @sentry/svelte | 22.65 KB | +0.03% | +5 B 🔺 |
| CDN Bundle | 23.77 KB | +0.03% | +7 B 🔺 |
| CDN Bundle (incl. Tracing) | 36.49 KB | +0.02% | +6 B 🔺 |
| CDN Bundle (incl. Tracing, Replay) | 70.9 KB | +0.01% | +7 B 🔺 |
| CDN Bundle (incl. Tracing, Replay, Feedback) | 76.21 KB | +0.01% | +7 B 🔺 |
| CDN Bundle - uncompressed | 69.63 KB | +0.02% | +12 B 🔺 |
| CDN Bundle (incl. Tracing) - uncompressed | 108.21 KB | +0.02% | +12 B 🔺 |
| CDN Bundle (incl. Tracing, Replay) - uncompressed | 219.86 KB | +0.01% | +12 B 🔺 |
| CDN Bundle (incl. Tracing, Replay, Feedback) - uncompressed | 233.05 KB | +0.01% | +12 B 🔺 |
| @sentry/nextjs (client) | 37.51 KB | +0.01% | +3 B 🔺 |
| @sentry/sveltekit (client) | 35.35 KB | +0.02% | +6 B 🔺 |
| @sentry/node | 114.63 KB | -1.54% | -1.79 KB 🔽 |
| @sentry/node - without tracing | 88.24 KB | -1.98% | -1.78 KB 🔽 |
| @sentry/aws-serverless | 97.64 KB | -1.82% | -1.8 KB 🔽 |
| @sentry/browser - with treeshaking flags | 21.3 KB | added | added |
Ok so with the async debugger via worker, this results in a memory leak. I guess this needs some cleanup.
Do we have to worry about object size with a large number of errors? Or a bunch of long lived error objects? Just trying to think if there's any edge cases with storing the local vars on the error objects themselves.
If we forsee any problem with that, then similar to the original approach it might be worth still using an extra datastructure to actually store the local variable state -- and force a maximum size (like with the LRU) that's configurable. And then have some sort of pointer to it on the error object instead (like a unique id).
re: memory leaks, I wonder if there's references to the error objects lingering around somewhere. I doubt the inspector api would write objects that are unable to be garbage collected, but I suppose it's possible if for some reason the inspector is holding references to the objects it made? 🤷
Do we have to worry about object size with a large number of errors?
It might be worth adding a limit regardless of how we're passing the variables.
re: memory leaks, I wonder if there's references
I've managed to fix the memory leak for the async version (Node > v19) by calling Runtime.releaseObject after the debugger has resumed. Still looking into the sync one which does leak, just less than the async one did.
I've managed to fix the memory leak for the async version (Node > v19) by calling Runtime.releaseObject after the debugger has resumed.
Niiiiiiiice, this makes sense!
Oh it stopped the memory leak but also stopped it from working.
Back to the drawing board 😂
I fixed the memory leak while keping the feature actually functioning by delaying the cleanup.
Interestingly, although the Node v22 memory leak test fails, the memory usage does drop off and stabilises after about 30 seconds:
{ i: 0, memUsage: 98.875 }
{ i: 1, memUsage: 106.546875 }
{ i: 2, memUsage: 110.484375 }
{ i: 3, memUsage: 113.21875 }
{ i: 4, memUsage: 116.078125 }
{ i: 5, memUsage: 121.390625 }
{ i: 6, memUsage: 137.21875 }
{ i: 7, memUsage: 149.046875 }
{ i: 8, memUsage: 163.6875 }
{ i: 9, memUsage: 165.71875 }
{ i: 10, memUsage: 136.046875 }
{ i: 11, memUsage: 136.25 }
{ i: 12, memUsage: 136.265625 }
{ i: 13, memUsage: 136.28125 }
{ i: 14, memUsage: 136.296875 }
{ i: 15, memUsage: 136.296875 }
{ i: 16, memUsage: 136.296875 }
{ i: 17, memUsage: 136.296875 }
{ i: 18, memUsage: 136.296875 }
{ i: 19, memUsage: 136.34375 }
{ i: 20, memUsage: 136.34375 }
{ i: 21, memUsage: 136.34375 }
{ i: 22, memUsage: 136.34375 }
{ i: 23, memUsage: 136.34375 }
{ i: 24, memUsage: 136.34375 }
{ i: 25, memUsage: 136.34375 }
{ i: 26, memUsage: 136.34375 }
{ i: 27, memUsage: 136.34375 }
{ i: 28, memUsage: 136.34375 }
{ i: 29, memUsage: 136.34375 }
{ i: 30, memUsage: 136.34375 }
{ i: 31, memUsage: 136.34375 }
{ i: 32, memUsage: 94.421875 }
{ i: 33, memUsage: 90.140625 }
{ i: 34, memUsage: 90.890625 }
{ i: 35, memUsage: 91.09375 }
{ i: 36, memUsage: 91.265625 }
{ i: 37, memUsage: 91.453125 }
{ i: 38, memUsage: 91.453125 }
While delaying the cleanup by a fixed 10 seconds does work, this has the potential to increase memory pressure and I'm not completely happy with it.
I fixed the memory leak while keping the feature actually functioning by delaying the cleanup.
I'm glad that we've found a way that works. Something doesn't smell right to me about this though...
The v8 inspector docs are quite light on what exactly these functions do (maybe worth perusing the v8 internal source)... but I would assume the Runtime.releaseObject simply releases the pointer the inspector is holding to the underlying object, to allow it to be garbage collected. Why would releasing that pointer cause the data we tried to save on the object to not be saved?
Surely awaiting on Runtime.callFunctionOn should wait for the state to fully clear/write to fully finish? I know that's not what we saw but something smells fishy here.
What do you think is going on? You probably have a better idea than I do.
The v8 inspector docs are quite light on what exactly these functions do
Not only what the functions do, but how all the commands should be used together! You can get a reasonable idea by looking what others have done via Github search.
What do you think is going on? You probably have a better idea than I do.
Unfortunately not. Most that I've learnt has been through a lot of trial and error. The Chrome debugger has a protocol inspector so you can see the commands it's sending and learn a bit from that.
From what I've observed, Runtime.callFunctionOn modifies the object and then retains a reference to it which stops it from being GC'd. Runtime.releaseObject does release the reference but also appears to revert the changes. Without digging into the v8 debugger code, I don't really know why it does this or why you'd want it to work like this.
Hey @timfish I just tried this branch in my local with and without delaying releaseObject and I get variables for both.
Can you double check on your end?
I'll be back looking at this on Monday evening on my way back from vacation.
The e2e test failures in CI suggested that the delay was required but I'll double check it.
I just tried this branch in my local with and without delaying releaseObject and I get variables for both
While we get local variables without the delay in releasing, we still get a memory leak. It looks like releaseObject needs to be called after the debugger has resumed to avoid the memory leak.
I just tried this branch in my local with and without delaying releaseObject and I get variables for both
While we get local variables without the delay in releasing, we still get a memory leak. It looks like
releaseObjectneeds to be called after the debugger has resumed to avoid the memory leak.
Nice find. That sounds incredibly annoying to have found. This is awesome though, it makes the whole local variables integration much simpler and much more robust 💪
Plot twist...
While setImmediate is enough for Node v20, for v22 it releases too early and local variables are missing 😬
Plot twist...
While
setImmediateis enough for Node v20, for v22 it releases too early and local variables are missing 😬
Honestly, that is equal parts interesting and terrifying. Feels like we're relying on undefined behaviour somehow 😭
You know, if it were easy to do bisecting to the exact v8 engine version that introduces that behaviour change would probably reveal a lot about the internals of nodejs/the inspector/object releasing.
I suspect the best way to do that is to checkout nodejs@ v22.x and bisect through deps/v8. The node v20.x is on 11.3.244.4 and node v22.x is on 12.4.254.14.
But uhh, that's pretty time consuming for a probably-useless-except-for-knowledge outcome lol. Maaaybe it'd be a bug in v8, which would be cool i guess 🤷 . Or maybe this is just undefined behaviour with no guarantees because their api docs don't specify anything >.<
More just saying this out loud to myself. Would be neat but won't pay the bills :'(
Honestly, that is equal parts interesting and terrifying. Feels like we're relying on undefined behaviour somehow
Yeah, feels like this entire debug API is prone to race conditions.
The node v20.x is on 11.3.244.4 and node v22.x is on 12.4.254.14
Could try more node versions to find an exact version where it changes. If it ends up on a v8 upgrade, the changeset will be huge 😭
Maaaybe it'd be a bug in v8, which would be cool i guess 🤷
Even if we find a bug in v8, it's unlikely a fix would make it into an LTS Node version anyway. We'd need to cherry pick that fix and apply it to older v8 versions which might be too much work.
It could also be a bug in the inspector module although feels less likely.
Gonna go ahead and merge this in.