haystack-core-integrations
haystack-core-integrations copied to clipboard
Change from basic psycopg to psycopg_pool
Is your feature request related to a problem? Please describe.
Yes, i'm currently using haystack pipeline & pgvector with FastAPI to serve embedding search but seems the current implementation of psycopg doesn't handle the reconnection after a long idle time Eg. overnight without any calls to API
the exception that thrown is psycopg.OperationalError: the connection is lost
Describe the solution you'd like i'd like to implement psycopg_pool to make it better in handling lost connection also for anyone who serve it via API this is more scaleable to use a connection pool.
Describe alternatives you've considered Increasing DB Connection max lifetime but seems not a good practice, make a retry call to PG also doesn't work with current implementation, re-initiate the connection are impossible because of the connection var never be None once it initialized https://github.com/deepset-ai/haystack-core-integrations/blob/60c666624cfb20a3600a28be074697fdf4cf2553/integrations/pgvector/src/haystack_integrations/document_stores/pgvector/document_store.py#L169-L173
Additional context I'd be happy to make a PR for this myself, just want to make sure that I'm not missing anything before working on this. open for discussion & also correct me if i'm wrong 🙇
Hey, @johantandy!
I understand the problem and I think we can do something to handle it... (although, I am not a DB expert)
The only perplexity is about using a different package: psycopg_pool.
Any other possible ideas? Or a draft of how you imagine the change?
(@silvanocerza)
Facing the same issue atm, but even for 5mins timeout.
Could be potentially use ? https://github.com/supabase/vecs
cc: @anakin87
Hey @anakin87 @silvanocerza just checking if this would worked on by the team or are you guys open to a PR
Sorry, I missed the discussion.
To be fair I don't see the benefit of a connection pool in this case, nor how it would solve the problem.
As of now the integration has at most one open connection and no more so why we need a pool of multiple connections?
The problem we have is that we open a single connection and keep it open as long as the Document Store exists. What we should do in my opinion is handle the connection context correctly, something that right we completely ignore.
If anyone wants to take care of that feel free to open a PR.
@octalpixel @johantandy any of you willing to take care of this using the connection context properly?
I would suggest taking a look at this first 👇 https://www.psycopg.org/psycopg3/docs/basic/usage.html#connection-context
This is up for grabs by anyone that want to fix how the connection is handled, see my above comment. ☝
@silvanocerza @julian-risch I would like to take up this issue.
@octalpixel @johantandy any of you willing to take care of this using the connection context properly?
I would suggest taking a look at this first 👇 https://www.psycopg.org/psycopg3/docs/basic/usage.html#connection-context
Hi Silvano, are you suggesting opening a new connection for each statement? That would definitely make things cleaner and safer, but I'm a bit unsure about the impact of the repeated connection time on each PgvectorDocumentStore statement. (An alternative might be to use PgvectorDocumentStore as a context... but that would break backward compatibility!)
How about just supplying a new connection to the document store like:
with psycopg.connect(...) as conn:
db.document_store.connection = conn
# query stuff
# connection closed
(At the moment this will fail because we cannot assign conn because there is no setter method.)
Currently, two main actions occur during connection setup:
- Some preliminary setup operations are performed (e.g., drop, create, create index, etc.).
- Two cursors are created.
I like @phiweger’s suggestion to use an external connection. This would streamline PgvectorDocumentStore’s role to handling SQL statements only, in line with the Single Responsibility Principle.
In this approach, the first argument of PgvectorDocumentStore.__init__ would be a connection, rather than a connection string (I’d prefer to avoid setting up a connection after the document store is instantiated).
PgvectorDocumentStore.__init__ would then take care of:
- Performing the setup operations.
- Creating the two cursors.
The challenge with this approach is that it would introduce backward incompatibility in the PgvectorDocumentStore constructor’s arguments
Thinking about it... @phiweger’s suggestion doesn't solve OP problem I'm considering that the best thing is to have DocumentStore connect when needed, execute the appropriate sql and then disconnect (without maintaining an indefinite connection). I think this is the most stateless and clean approach... as I said, I have some small doubts about the impact of connection times
I built a test case https://github.com/depaolim/haystack-core-integrations/blob/issue_847/integrations/pgvector/tests/test_document_store.py#L48
And I've approached a solution (which I'm far from satisfied with at the moment!)
The key is to wrap automatically each method and build a connection when needed https://github.com/depaolim/haystack-core-integrations/blob/0c3b96001d6c0a0cec77ec882afba466dfa0c179/integrations/pgvector/src/haystack_integrations/document_stores/pgvector/document_store.py#L178
(There is a sort of reference counting and a commit/rollback/close when the connection is no more needed)
Open problems are the following:
- there is too much (unnecessary) refactoring (my approach is to start "moving things around", just to understand the roles of the various parts. Obviously, I can prepare a "minimal delta")
- some asserts and print and TODOs... left back to signal some corner cases
- EmbeddingRetriever and KeywordRetriever do not work properly (the paradox is that, at the moment, I modified the sources just to make the testcases pass, but ... the modification is not correct)
I believe there’s a simpler and more deliberate approach: split the functionality into two classes—PgvectorDocumentStore and PgvectorDocumentStoreConnected.
-
PgvectorDocumentStore: This class would have the same interface as the current one, with identical methods and arguments. Its role would be to manage database connections (connect, commit/rollback, disconnect) and delegate actual operations to the methods ofPgvectorDocumentStoreConnected. -
PgvectorDocumentStoreConnected(orPgvectorDocumentStoreInternal): This class would function like the current PgvectorDocumentStore, but without handling database connections. It would simply have aconnectionmember, likely passed in through its__init__method.
Advantages of this approach:
PgvectorDocumentStoreConnectedcan be used independently, for example, when a client already has a database connection or wants to manage transactions in a more complex or long-lived manner than a single method call.
Comments and observations much appreciated!
@depaolim I like the fact that you leave the interface intact and delegate "under the hood" to PgvectorDocumentStoreConnected . Also, this approach has the practical advantage that you could roll out the changes you proposed above quickly, including leaving edge cases uncovered but warned about. I very much like your design!
Hey guys...
Thanks for all the inputs, discussions and proposals.
I'll have a look later this week, so we can find a way to fix this.
I think I have fixed this in #1202. You can find this change in the latest release: https://pypi.org/project/pgvector-haystack/1.1.0/
If you experiment with it, let me know if it solves your problem.