manatee
manatee copied to clipboard
postgresMgr.js null dereference
Observed in (a private branch based on) 99fd1bff1858 "Uncaught TypeError: Cannot call method 'removeAllListeners' of null" on line analogous to line 1634 in 99fd1bff1858.
Admittedly we are on a private branch with other changes and it is based pretty far back in history but I believe the problem is still relevant.
Having plastered the code with debug logging and reproduced, the basic order of events is that that query error handler happens, nulls out _pgClient, and then that client error handler happens and crashes.
How reproducible is this problem for you? Are you (or can you) run with --abort-on-uncaught-exception and share the resulting core file?
Uh, pretty reproducible (as in I can make it happen again with stress testing, not that it happens often). Due to the vagaries of how we're running it I'm slightly reticent to go down this path, especially when I think the nature of the crash is relatively clear from the debug logging I had added (both error handlers are called and in an order that makes the second one crash).
I think I understand what you're saying, but without understanding this better, I'm not sure what the most appropriate fix is. (I don't want to blindly add a defensive check that papers over other potential problems.) Do you happen to still have the debug output, or anything about which errors were emitted? Was it the same underlying error that was reported on both the query and the client?
Thanks again for reporting this.
I super hate to be that guy, but I'm not sure what I can or can't share so I'm in a weird spot. At the least the error the query handler logged was:
{"name":"manatee-sitter","hostname":"XXX","pid":126104,"component":"PostgresMgr","level":40,"err":{"message":"terminating connection due to administrator command","name":"error","stack":"error: terminating connection due to administrator command\n at Connection.parseE (PATH/manatee/node_modules/pg/lib/connection.js:521:11)\n at Connection.parseMessage (PATH/manatee/node_modules/pg/lib/connection.js:351:17)\n at Socket.
(PATH/manatee/node_modules/pg/lib/connection.js:101:22)\n at Socket.emit (events.js:95:17)\n at Socket. (stream_readable.js:765:14)\n at Socket.emit (events.js:92:17)\n at emitReadable (_stream_readable.js:427:10)\n at emitReadable (_stream_readable.js:423:5)\n at readableAddChunk (_stream_readable.js:166:9)\n at Socket.Readable.push (_stream_readable.js:128:10)","code":"57P01"},"msg":"got err","time":"2015-09-17T22:49:38.024Z","src":{"file":"PATH/manatee/lib/postgresMgr.js","line":1675,"func":"PostgresMgr._pgQueryFini"},"v":0}
The error the client error handler logged (right before dying) was:
{"name":"manatee-sitter","hostname":"XXX","pid":126104,"component":"PostgresMgr","level":50,"err":{"message":"read ECONNRESET","name":"Error","stack":"Error: read ECONNRESET\n at errnoException (net.js:905:11)\n at TCP.onread (net.js:559:19)","code":"ECONNRESET"},"msg":"got pg client error","time":"2015-09-17T22:49:38.026Z","src":{"file":"PATH/manatee/lib/postgresMgr.js","line":1615},"v":0}
This was immediately after it decided it was going to stop postgres to assume the role of sync (last normal line above first error was the "trying SIGINT" message).
Thanks. Interesting that they were two different errors.
Sorry to be so sparing with the actual data, but the powers that be move slowly. Here is an extremely minimal slice of that debugging log (just time and msg, and maybe a minor comment of mine in square brackets):
2015-09-17T22:49:36.019Z _pgQueryKick dispatching request 2015-09-17T22:49:36.019Z start query 'end' callback 2015-09-17T22:49:36.019Z start _pgQueryFini 2015-09-17T22:49:36.020Z finish _pgQueryFini 2015-09-17T22:49:36.020Z end query 'end' callback 2015-09-17T22:49:37.021Z _pgQueryKick dispatching request 2015-09-17T22:49:37.021Z start query 'end' callback 2015-09-17T22:49:37.022Z start _pgQueryFini 2015-09-17T22:49:37.022Z finish _pgQueryFini 2015-09-17T22:49:37.022Z end query 'end' callback 2015-09-17T22:49:38.012Z incoming event [event was activeChange] 2015-09-17T22:49:38.019Z incoming event [event was clusterStateChange] 2015-09-17T22:49:38.020Z assuming role [role was sync] 2015-09-17T22:49:38.020Z PostgresMgr.reconfigure: entering 2015-09-17T22:49:38.021Z PostgresMgr._standby: entering 2015-09-17T22:49:38.021Z PostgresMgr.initDb: stop postgres 2015-09-17T22:49:38.021Z PostgresMgr.stop: entering 2015-09-17T22:49:38.021Z PostgresMgr.stop: trying SIGINT 2015-09-17T22:49:38.022Z _pgQueryKick dispatching request 2015-09-17T22:49:38.023Z start query 'error' callback 2015-09-17T22:49:38.023Z start _pgQueryFini 2015-09-17T22:49:38.024Z got err [this was the first error I pasted above] 2015-09-17T22:49:38.025Z _pgQueryFini: nuking _pgClient due to error 2015-09-17T22:49:38.026Z finish _pgQueryFini 2015-09-17T22:49:38.026Z finish query 'error' callback 2015-09-17T22:49:38.026Z start client error callback 2015-09-17T22:49:38.026Z got pg client error [this was the second error I pasted above]
I put up a sketch of a rewrite of some of the state management which fixes the problem, for values of "fixes" equal to "based on my limited understanding of javascript" and "could no longer repro crash".
I got the go-ahead to post a core dump so if you'd still rather see that I guess I can remove the proposed fix locally, reproduce again, etc. I guess let me know either way.
Thanks for taking a swing at fixing this. I'm not sure I understand the new invariants, though. Before the change, the invariants were that there was only ever one active client, and it was represented by self._pgClient. There was only ever one outstanding request, and it's represented by self._pgRequestOutstanding. Your change obviously assumes that it's possible to get a client's error callback fired even after it's no longer the only client. Is that what happened in your test case? The change also assumes that when a query ends, it may not be the currently outstanding request. How can that happen?
At 1645 in the new file, it seems like you leak the 'error' listener with each new query. If we issue a request that succeeds, but there's a subsequent client error (say, 10 requests later), won't we fire 'error' callbacks for all of the successful requests that happened before it?
Thanks again for taking a look. I know this code is rather hairy, and it's also tricky to get right. (It could also be made a lot simpler with a fix for node-postgres#718.)
Thanks for taking a swing at fixing this. I'm not sure I understand the new invariants, though. Before the change, the invariants were that there was only ever one active client, and it was represented by self._pgClient. There was only ever one outstanding request, and it's represented by self._pgRequestOutstanding. Your change obviously assumes that it's possible to get a client's error callback fired even after it's no longer the only client. Is that what happened in your test case?
Uh, I am not strictly sure whether or not I observed that. I chose "self._pgClient !== selfClient" over "self._pgClient !== null" because I think it's going to be safer and more resilient (either to mistakes or to future mucking in this code).
The change also assumes that when a query ends, it may not be the currently outstanding request. How can that happen?
Again, I don't have an execution path that leads to that. Maybe client errors (killing request) but query actually then completes? I don't know if that's possible due to extreme lack of knowledge of what the postgres library guarantees or doesn't. As I implied earlier, most choices in what I wrote are informed by fear and desire for resilience in the face of uncertainty (about what current libraries do, about what people might change surrounding code to do, etc.).
At 1645 in the new file, it seems like you leak the 'error' listener with each new query. If we issue a request that succeeds, but there's a subsequent client error (say, 10 requests later), won't we fire 'error' callbacks for all of the successful requests that happened before it?
Uh, sure does seem that way. At the least _pgQueryFini won't do anything stupid since it will check the outstanding request and log the error as spurious (presumably several times). Maybe the other two handlers (query error and query completion) should remove that one listener specifically? Let me go actually learn enough JS to figure out how to do that...
Updated the PR (now that the new cut has been through a few nights of not breaking).