node-pg-pubsub icon indicating copy to clipboard operation
node-pg-pubsub copied to clipboard

Events stop firing on database error

Open cristianocca opened this issue 8 years ago • 19 comments

If you send a string over 8000 bytes as the payload which makes postgres reject it, an impossible to catch error is thrown somewhere. If you manage to catch the error such as a global exception handler from node, your app won't crash completely, but the publish subscribe code stops working, events are sent but never received on the callback assigned to addChannel. I'm wondering if this can happen for other kind of database errors as well.

cristianocca avatar May 22 '16 02:05 cristianocca

Looks like it happens with pretty much any query that fails to run, and the issue is that query errors are not being handled and so if a connection re connect is required, it is never performed. I think it would be good to handle errors on all db.query calls and call retry.reset() in case a new connection is needed.

cristianocca avatar May 22 '16 07:05 cristianocca

I think this should be fixed now in 0.3.0, which I released as an additional version after the SQL Injection fix as it makes publish() now return a promise, which is an API change of that method and thus something that should be a minor-increase in the version number.

I've also added tests to catch the specific error of too large of a payload to ensure that the error is handled.

Please try the new version and report back if it doesn't solve the problem fully.

voxpelli avatar May 22 '16 12:05 voxpelli

Hello, thanks for the quick response!

The promise change is a great addition as catching errors was impossible before (remember to update the docs!) However, for connection errors, it still seems the retry mechanism will fail. This is what I have tested:

  • Publish with ok connection --> promise works fine
  • Kill connection (I do this killing my own network)
  • Publish the first time on connection error works fine, error promise is fired
  • Publish a second time fails to fire any promise code (no sucess nor error fired)
  • Restore connection, the above still happens, server restart is required.

I dont think this is a timeout issue, maybe an issue with postgres not raising errors a second time when the connection is dead? Would a reset to the re try promise work in cases where the error is a connection error?

error

cristianocca avatar May 22 '16 14:05 cristianocca

Hi, any ideas on this? Can you re open it so it doesn't get lost?

cristianocca avatar May 24 '16 21:05 cristianocca

No ideas currently – I thought I had tested this specific case, but will have to look into it again.

voxpelli avatar May 25 '16 04:05 voxpelli

I will take a look over the weekend and see if I can make any code change that fixes it if it is not fixed by then, thanks for reopening!

cristianocca avatar May 25 '16 22:05 cristianocca

Testing a bit more...

Once connection dies, it only fires the error callback after db.query once, after that, all queries gets queued up and never fired, which also never fires error or sucess callbacks, leaving that connection completely unusable. The db.on('error') callback is also never fired! Would this be a node-pg issue?

The only work around is to re connect to the database once a connection error is detected.

I have tested and adding a:

if(err && err.code == 'ECONNRESET'){          
    db.emit('error');
}

On the publish method can help restore the connection. However, if you are not publishing anything but instead only listening, there doesn't seem to be a way to fire this error event because no error event is fired on connection error until you actually do some query.

Considering how unlikely an error due to connection lost could happen, this might not be that important.

cristianocca avatar May 29 '16 17:05 cristianocca

Interesting. Thanks for looking into this. Recovering form a connection failure is certainly something that should work so will try to find time to look into it more.

voxpelli avatar May 30 '16 17:05 voxpelli

I wrote some thoughts on this issue here: http://stackoverflow.com/questions/36756685/is-using-an-audit-table-in-postgres-to-create-triggers-for-notify-listen-a-good

But basically, one has to monitor the channel to see if it is still alive or not. Once it appears to be dead, close the Client connection, create a new Client, connect and set up all the listeners afresh.

That is the only way.

The most difficult part - monitoring the channel to detect when it dies :) I haven't done it yet, but it should be doable.

I have also added examples within pg-promise: LISTEN / NOTIFY, for a simple use-case.

vitaly-t avatar Jun 04 '16 13:06 vitaly-t

I see, is the only way to run a trivial query from time to time through the same connection and if a connection reset error happens, just reconnect or emit a db.emit('error') so the same existing code handles reconnecting? I don't think a simple select 1 every minute or so would cause any harm, maybe have the period as a setting

The

if(err && err.code == 'ECONNRESET'){          
    db.emit('error');
}

line should also still be added to all requests (such as listen or notify calls) because if the connection error is first detected there, any following query will fail silently and no errors for some reason, so the monitoring part would fail to detect the error.

cristianocca avatar Jun 04 '16 14:06 cristianocca

@cristianocca that should work ;)

line should also still be added to all requests (such as listen or notify calls) because if the connection error is first detected there

No, you don't need that. LISTEN is always executed against a fresh connection, so there is no point checking, while NOTIFY is executed against its own (or any other) connection, so it is not relevant.

Although you may want to check the listener just before sending the NOITIFY, assuming the two are both within the app, and one on the server.

vitaly-t avatar Jun 04 '16 14:06 vitaly-t

@vitaly-t Sorry, just to be sure...

I'm talking about PGPubsub.prototype.addChannel which runs a db.query('LISTEN ...), the connection used for it could be either a new one because it was the first time it was called, or an existing one, which could potentially fail if the connection is lost, or am I missing something? the this._getDB call looks like it will always return the existing db connection if it was previously called.

The same would happen with PGPubsub.prototype.removeChannel and PGPubsub.prototype.publish. Don't they always use an existing connection if it was already previously called? So if any of those have a connection reset error (which is right now not restarting the connection in any way), if a future monitor checking the connection status through a trivial query won't get any error because the connection reset error was already raised and for some reason the pg library won't raise it anymore on future queries.

cristianocca avatar Jun 04 '16 22:06 cristianocca

Yes, @cristianocca is right, there's only a single connection used for all LISTEN:s so that one can do as many LISTEN as one wants without it requiring a million connections.

The mechanism in place builds on my promised-retry library that in this case has been set up so that it should be able to reconnect whenever the connection is lost.

Apparently the promised-retry config in this case isn't fully complete and therefore doesn't always reconnect so I will have to look into the solution proposed by @cristianocca to solve that case.

Should perhaps look into an automated pinging mechanism as well. By eg. generating a unique channel name on startup and automatically periodically sending a NOTIFY to that channel and if either an error is discovered when doing that or no ping is received when expected, then fail the connection.

voxpelli avatar Jun 06 '16 17:06 voxpelli

@voxpelli pinging should be configurable as to the frequency, so one can set it according to the average frequency of expected notifications, so you don't miss too many notifications during the disconnected state.

One complication here - the strategy for when the connection is gone and is not yet available, that's a tricky one ;) You need to stretch your re-tries, to avoid flooding the IO ;)

vitaly-t avatar Jun 06 '16 17:06 vitaly-t

@vitaly-t I agree that a pinging mechanism should be configurable.

Regarding the strategy for when the connection is gone and is not yet available – that will mostly be handled by promised-retry, which will return a promise that won't be fulfilled until a connection has been acquired (or optionally rejected after a certain amount of retries, but that's not something that this module currently uses)

voxpelli avatar Jun 06 '16 17:06 voxpelli

@voxpelli not really related, and I may have posted it in the past - in case you decide to change it over to pg-promise, I will help you with that ;) Current approach to LISTEN / NOTIFY.

vitaly-t avatar Jun 06 '16 17:06 vitaly-t

I agree with the self created channel used to ping the connection periodically. It won't hurt anyone as it will use the same connection that's already being used and a very simple ping shouldn't add a big overhead.

One question though, is this should be really handled by the library, or is actually an issue from the postgres library that fails to raise the error event when there are issues with the connection, which IMO is a bit lame.

cristianocca avatar Jun 06 '16 22:06 cristianocca

@cristianocca @voxpelli

It won't hurt anyone as it will use the same connection...

A proper ping algorithm would check when the last normal notification occurred, to see if pinging is needed at all. And it would execute a ping only if there wasn't any notification going on, as per the frequency of the ping configuration.

In this case it would be of absolutely no "harm" to anything.

vitaly-t avatar Jun 06 '16 23:06 vitaly-t

My events also seem to stop firing when there's a network error.

super-cache-money avatar Nov 22 '16 21:11 super-cache-money