txpostgres
txpostgres copied to clipboard
`ProgrammingError` after reconnecting.
I'm using ConnectionPool
with DeadConnectionDetector
. Here's my connection function, if it helps.
This morning, my primary database failed, which happened to be during a period of high-frequency querying. So of course, psycopg2.OperationalError
appears several times, before it was replaced with ConnectionDead
. Shortly thereafter, my standby server was promoted to master. txpostgres was then able to reconnect, which [I assume] it did. But then, immediately, the following error starts: psycopg2.ProgrammingError: execute cannot be used while an asynchronous query is underway
. As far as I can tell, no queries were successfully executed until I restarted the app.
I've been digging around the txpostgres code to try to find how this could happen, but I haven't had any enlightenment so far. I was wondering if you might have any ideas.
Here are some tracebacks, but—as is usual for Twisted—they aren't very helpful.
Hi,
First of all, thanks for the report!
Which version of Twisted are you using? The culprit seems to be this exception:
ValueError: file descriptor cannot be a negative integer (-1)
which is something that has been plaguing me for a long time - there's at least one Twisted ticket that has been opened in connection with this problem:
http://twistedmatrix.com/trac/ticket/4539
As you can see, the ticket is now solved, but I've seen this error happen on our production system once, even though we're running recent Twisted. OTOH, we've had a master/slave fallback recently and everything kept working.
It would be very good if you could come up with a way to reproduce the problem. At this point, I'm not even sure if it's Twisted's or psycopg2's fault... The -1 file descriptor error is most probably Twisted, but the asynchronous query underway one - that's coming from psycopg2. txpostgres calls close()
on cursor detected to be in an error state, but perhaps under some circumstances that doesn't really clear the error...
Without a reproducible test case it's hard to debug; I might try to do some tests with hammering an instance with queries while restarting the database, but in the meantime I'm flying blind :(
Which version of Twisted are you using?
I'm using Twisted 13.2.0. Not the most recent, but within the past year.
OTOH, we've had a master/slave fallback recently and everything kept working.
I should note that I've also had failovers that have worked fine, this is the first time something went wrong.
Without a reproducible test case it's hard to debug; I might try to do some tests with hammering an instance with queries while restarting the database, but in the meantime I'm flying blind :(
I'll play around when I get a chance and see if I can come up with anything.
After doing some mild investigation: It looks like Ticket #4539 is broken again. The test_foo.py
you wrote fails on Ubuntu. (Works fine on my mac. Different reactor?)
It appears that the tests created for the previous fix (removedFromReactor, negativeOneFileDescriptor) are passing.
Thanks for the investigation! I just tried with Twisted trunk and test_foo.py
failed on my Debian laptop.
I tried with various reactors and both epoll
(the default on Linux) and poll
failed, while select
worked - and AFAICS select
is what's used on OSX.
I'll try digging deeper into that, although I'm also a bit concerned about this problem being a red herring. I'm not 100% positive that it's what has been causing the reconnection problem you got; it would be great to have a reproducible test case for that so we can be sure that fixing the -1 FD problem fixes the txpostgres bug.
OK, that one was easy: for epoll, fileno()
is called from inside addReader
, so it's obvious why my test_foo.py
was failing. If I move self.insideReactor = True
to just above self.reactor.addReader(self)
the test passes.
I'll try to write a program to bombard a Postgres instance with queries and kill the database - let's see if I manage to get a similar error to yours.
I did a few more tests with txpostgres and Twisted trunk and have been able to reproduce the -1 error, but it still managed to recover and continue querying. Here's the code:
https://gist.github.com/fa25fbae2855e8e5d2d6
I'm running it in one terminal and in another one doing:
sudo pg_ctlcluster -m fast 9.1 main restart
This gives me
Unhandled Error
Traceback (most recent call last):
File "/home/wulczer/src/twisted.git/twisted/python/log.py", line 88, in callWithLogger
return callWithContext({"system": lp}, func, *args, **kw)
File "/home/wulczer/src/twisted.git/twisted/python/log.py", line 73, in callWithContext
return context.call({ILogContext: newCtx}, func, *args, **kw)
File "/home/wulczer/src/twisted.git/twisted/python/context.py", line 118, in callWithContext
return self.currentContext().callWithContext(ctx, func, *args, **kw)
File "/home/wulczer/src/twisted.git/twisted/python/context.py", line 81, in callWithContext
return func(*args,**kw)
--- <exception caught here> ---
File "/home/wulczer/src/twisted.git/twisted/internet/posixbase.py", line 614, in _doReadOrWrite
why = selectable.doRead()
File "/home/wulczer/src/txpostgres/txpostgres/txpostgres.py", line 683, in doRead
self.reactor.addReader(self)
File "/home/wulczer/src/twisted.git/twisted/internet/epollreactor.py", line 271, in addReader
EPOLLIN, EPOLLOUT)
File "/home/wulczer/src/twisted.git/twisted/internet/epollreactor.py", line 257, in _add
self._poller.register(fd, flags)
exceptions.ValueError: file descriptor cannot be a negative integer (-1)
but after that I get a reconnection and the queries keep running.
I start to suspect that the -1 FD is indeed a red herring. We should focus on reproducing the actual problem with the asynchronous query underway
error.