trycatch
trycatch copied to clipboard
Potential Memory Leak
We've been trying to track down a memory leak which has been killing our prod boxes. Eventually, with the help of the heapdump module, we've been led to believe that it has something to do with either trycatch, or the way we use it. See attached screenshot of profile.
The doCatch function is the closure we pass as the catchFn to trycatch, and onError and onMessageHandled are functions referenced from doCatch.
According to the profile, it seems that these functions can't get cleaned up due to persistent references from inside trycatch, in particular via the _trycatchOnError function. This function in turn creates a reference from the child domain to the parent domain. It does this via https://github.com/CrabDude/trycatch/blob/0fc57b39d3248b0c9e90dbcf449509e30621c03f/lib/trycatch.js#L58 and https://github.com/CrabDude/trycatch/blob/0fc57b39d3248b0c9e90dbcf449509e30621c03f/lib/trycatch.js#L98. Seemingly this reference from child to parent gets cascaded and prevents the whole chain from being cleaned up.
To test this theory, I hacked trycatch locally to pass a callback to the tryFn (incidentally the catchFn doesn't need to get invoked to create to the leak). This callback would then remove the listener added at https://github.com/CrabDude/trycatch/blob/0fc57b39d3248b0c9e90dbcf449509e30621c03f/lib/trycatch.js#L58 which breaks the chain and gets rid of the leak.
I'm not sure whether this is a feasible route to go down though, as handling the error case, where the catchFn actually needs to be called, which itself could fail, would be challenging.
What is also confusing is that we do use trycatch elsewhere and I haven't seemed to notice any leaks there, though that could be due to us using it at much larger scale in this use case, which makes the leaks visible.
Oh fun. First, thanks for all the work you've done here. trycatch
is complex, and memory leaks even more so.
So, this is basically an asynchronous stack overflow, and so isn't a memory-leak in the traditional sense, so much as increased memory retention/overhead from retained parent domains while child domains remain. Normally, this is not an issue, since mostly I use trycatch
in a request oriented context (e.g., http 500s), and so inevitably, the child domains are released and the parent domains along with it.
However, it would cause a memory-leak equivalent in circumstances where you had an indefinitely growing asynchronous call stack (A calls B calls C calls D ... ad infinitum), which I suspect is what you're seeing. This is analogous to V8 historically limiting a call-stack to 200 frames for memory reasons.
So..... Since catchFn
is run in the parent domain for nesting support, your "fix" (which I suspect does address the memory leak), will break catch support when errors are asynchronously thrown from catchFn
while nesting trycatch
calls.
This is all stupid in the sense that domain
is deprecated and should really just be torn out, despite it theoretically providing better resource cleanup (probably no longer true since domains have been deprecated for a while) and some EventEmitter conveniences. Domains can be replaced with an object that stores the state of the current synchronous stack and the associated catchFn.
Not sure when I'll have a chance to get to this since it may be a decent amount of work. Then again, since we'd just be removing domains with some shallow EE instance, it may potentially be very easy.
TLDR; Domains should be removed. Your hack solves your issue, but breaks catching of async catchFn
errors when nesting.
I'm honestly trying to think of a short-term solution for you, but there really isn't one. Your fix is the your best option since the async catch nesting isn't necessarily a big concern for you.
Also, domains could be kept in a WeakMap, on the catchFn / EE state object, so that the async error nesting would work, but allow domains to be GCd. This is maybe a more attractive solution, though I'm not sure it's any easier and so doesn't help you at the moment.
Looks like WeakMap
is supported without a flag in v0.12, which is promising, because it precludes the need for the tradeoff (removing domains before an alternative exists).
Thanks a lot for the feedback!
Your mention of WeakMap
lead me to try this: https://github.com/irond13/trycatch/commit/14d364db2ee310a1d1905bb3fe24fe05ad2248b8. This didn't have the desired effect - it just shifted the memory usage elsewhere, the net usage remaining the same. It turns out that the semantics of WeakMap
are that the keys are weakly referenced and not the values, contrary to my initial assumption. In light of this, the observed behaviour makes complete sense, and I suspect makes a solution using WeakMap
impossible. If we had something like Java's WeakReference
, then I could foresee a solution, but it seems that there is no such thing in the pure JS world at the moment. https://github.com/TooTallNate/node-weak could potentially work (I haven't tried it), but it feels too heavy for my liking.
I then turned my attention to what was arguably the actual source of this issue - the potentially infinite nesting of trycatch
calls. You were correct in your assumption that this is what was happening (the Retainers pane in the heapdump I posted in the OP actually shows this pretty clearly - I just couldn't grok it at the time :)). In a nutshell what we had was the following (this is a queue listener of sorts):
function onError(err){/* handle error */}
var initiateMessageProcessing = function(){
trycatch(function doTry(){
fetchMessage(function messageReceived(err, msg){
initiateMessageProcessing(); //Start processing/fetching of next message
if (err) return onError(err);
//do actual (async) message processing here
});
},
function doCatch(err){
onError();
}
);
};
initiateMessageProcessing();
Looking at this, the infinite trycatch
nesting seems obvious. Interestingly, this doesn't actually result in an infinite domain stack (I debugged domain.enter
and the domain stack size never grew greater than around 5, mirroring the max concurrent - not cumulative - instances of doTry
running at any one time).
In this case, the nesting of trycatch
calls is just a consequence of the 'recursive' initiateMessageProcessing
call - it isn't the intent (actually the opposite - ideally each initiateMessageProcessing
run should be treated as an independant 'request'). With that in mind, I first tried to remove the nesting of domains by explicitly exiting and then re-entering the active domain around the inner initiateMessageProcessing
call, e.g.
//Inside messageReceived
var activeDomain = domain.active;
activeDomain && activeDomain.exit();
initiateMessageProcessing()
activeDomain && activeDomain.enter();
That didn't work, because as it turned out, each time that code was hit, the active domain also happened to be the next domain on the stack, so calling activeDomain.exit()
left domain.active
unchanged. This happened consistently. I tracked this down to internal node code (inside the processing of the next tick) which called domain.enter()
on a domain without checking if it was already active (trycatch
correctly does this check, but it had already entered the domain by this stage).
My final port of call was to create my own root domain and bind initiateMessageProcessing
to that, as follows
var rootDomain = domain.active || domain.create();
var initiateMessageProcessing = function(){/* as before */};
initiateMessageProcessing = rootDomain.bind(initiateMessageProcessing);
initiateMessageProcessing();
Fortunately, this worked! I'm relatively satisfied with this solution so am I'm going to go with it for now.
So, in closing, since it turned out that it was actually my code which caused the nesting of trycatch
I'm going to close this issue. I'm not entirely satisfied with my solution to this particular use case though, as it relies on implementation details of trycatch
(it was quite nice not having to know that trycatch
uses domains under the hood). If for some reason an alternative to domains does materialise some day, I'll be more than happy to update my workaround - after this expedition I'm in agreement with those who say that domains should make an exit from Node sooner rather than later (excuse the pun!).
I appreciate you working with me to reach a solution. You're a model user / contributor. ;)
I think there's definitely a solution here using WeakMap (I know you concluded the opposite), that would allow the parent domains to close, while maintaining a reference to the catch callback. This would completely eliminate all memory growth except the catch callback function, which would be relatively negligible. With long-stack-traces on, it would also keep a ~100B reference to a call stack string. Both would add up for long running processes.
There still needs to be a way to avoid these slow leaks, so I'm going to reopen this. Your ~~hack~~ solution inspires me to add a nesting opt-out, which I had never considered:
trycatch.noNest(tryFn, catchFn)
// or...
trycatch.root(tryFn, catchFn)
This will be relatively easy to add, as it would just skip the check for a currently active domain. Thanks!