haystack-core-integrations icon indicating copy to clipboard operation
haystack-core-integrations copied to clipboard

Change from basic psycopg to psycopg_pool

Open johantandy opened this issue 1 year ago • 6 comments

Is your feature request related to a problem? Please describe. Yes, i'm currently using haystack pipeline & pgvector with FastAPI to serve embedding search but seems the current implementation of psycopg doesn't handle the reconnection after a long idle time Eg. overnight without any calls to API the exception that thrown is psycopg.OperationalError: the connection is lost

Describe the solution you'd like i'd like to implement psycopg_pool to make it better in handling lost connection also for anyone who serve it via API this is more scaleable to use a connection pool.

Describe alternatives you've considered Increasing DB Connection max lifetime but seems not a good practice, make a retry call to PG also doesn't work with current implementation, re-initiate the connection are impossible because of the connection var never be None once it initialized https://github.com/deepset-ai/haystack-core-integrations/blob/60c666624cfb20a3600a28be074697fdf4cf2553/integrations/pgvector/src/haystack_integrations/document_stores/pgvector/document_store.py#L169-L173

Additional context I'd be happy to make a PR for this myself, just want to make sure that I'm not missing anything before working on this. open for discussion & also correct me if i'm wrong 🙇

johantandy avatar Jun 24 '24 16:06 johantandy

Hey, @johantandy!

I understand the problem and I think we can do something to handle it... (although, I am not a DB expert)

The only perplexity is about using a different package: psycopg_pool.

Any other possible ideas? Or a draft of how you imagine the change?

(@silvanocerza)

anakin87 avatar Jun 25 '24 11:06 anakin87

Facing the same issue atm, but even for 5mins timeout.

Could be potentially use ? https://github.com/supabase/vecs

cc: @anakin87

octalpixel avatar Jul 04 '24 22:07 octalpixel

Hey @anakin87 @silvanocerza just checking if this would worked on by the team or are you guys open to a PR

octalpixel avatar Jul 09 '24 11:07 octalpixel

Sorry, I missed the discussion.

To be fair I don't see the benefit of a connection pool in this case, nor how it would solve the problem.

As of now the integration has at most one open connection and no more so why we need a pool of multiple connections?

The problem we have is that we open a single connection and keep it open as long as the Document Store exists. What we should do in my opinion is handle the connection context correctly, something that right we completely ignore.

If anyone wants to take care of that feel free to open a PR.

silvanocerza avatar Jul 11 '24 08:07 silvanocerza

@octalpixel @johantandy any of you willing to take care of this using the connection context properly?

I would suggest taking a look at this first 👇 https://www.psycopg.org/psycopg3/docs/basic/usage.html#connection-context

silvanocerza avatar Jul 22 '24 08:07 silvanocerza

This is up for grabs by anyone that want to fix how the connection is handled, see my above comment. ☝

silvanocerza avatar Sep 02 '24 08:09 silvanocerza

@silvanocerza @julian-risch I would like to take up this issue.

srini047 avatar Oct 11 '24 08:10 srini047

@octalpixel @johantandy any of you willing to take care of this using the connection context properly?

I would suggest taking a look at this first 👇 https://www.psycopg.org/psycopg3/docs/basic/usage.html#connection-context

Hi Silvano, are you suggesting opening a new connection for each statement? That would definitely make things cleaner and safer, but I'm a bit unsure about the impact of the repeated connection time on each PgvectorDocumentStore statement. (An alternative might be to use PgvectorDocumentStore as a context... but that would break backward compatibility!)

depaolim avatar Nov 10 '24 18:11 depaolim

How about just supplying a new connection to the document store like:


with psycopg.connect(...) as conn:
    db.document_store.connection = conn
    # query stuff
 
# connection closed

(At the moment this will fail because we cannot assign conn because there is no setter method.)

phiweger avatar Nov 11 '24 19:11 phiweger

Currently, two main actions occur during connection setup:

  • Some preliminary setup operations are performed (e.g., drop, create, create index, etc.).
  • Two cursors are created.

I like @phiweger’s suggestion to use an external connection. This would streamline PgvectorDocumentStore’s role to handling SQL statements only, in line with the Single Responsibility Principle.

In this approach, the first argument of PgvectorDocumentStore.__init__ would be a connection, rather than a connection string (I’d prefer to avoid setting up a connection after the document store is instantiated).

PgvectorDocumentStore.__init__ would then take care of:

  • Performing the setup operations.
  • Creating the two cursors.

The challenge with this approach is that it would introduce backward incompatibility in the PgvectorDocumentStore constructor’s arguments

depaolim avatar Nov 11 '24 21:11 depaolim

Thinking about it... @phiweger’s suggestion doesn't solve OP problem I'm considering that the best thing is to have DocumentStore connect when needed, execute the appropriate sql and then disconnect (without maintaining an indefinite connection). I think this is the most stateless and clean approach... as I said, I have some small doubts about the impact of connection times

depaolim avatar Nov 12 '24 05:11 depaolim

I built a test case https://github.com/depaolim/haystack-core-integrations/blob/issue_847/integrations/pgvector/tests/test_document_store.py#L48

And I've approached a solution (which I'm far from satisfied with at the moment!)

The key is to wrap automatically each method and build a connection when needed https://github.com/depaolim/haystack-core-integrations/blob/0c3b96001d6c0a0cec77ec882afba466dfa0c179/integrations/pgvector/src/haystack_integrations/document_stores/pgvector/document_store.py#L178

(There is a sort of reference counting and a commit/rollback/close when the connection is no more needed)

Open problems are the following:

  • there is too much (unnecessary) refactoring (my approach is to start "moving things around", just to understand the roles of the various parts. Obviously, I can prepare a "minimal delta")
  • some asserts and print and TODOs... left back to signal some corner cases
  • EmbeddingRetriever and KeywordRetriever do not work properly (the paradox is that, at the moment, I modified the sources just to make the testcases pass, but ... the modification is not correct)

depaolim avatar Nov 17 '24 15:11 depaolim

I believe there’s a simpler and more deliberate approach: split the functionality into two classes—PgvectorDocumentStore and PgvectorDocumentStoreConnected.

  1. PgvectorDocumentStore: This class would have the same interface as the current one, with identical methods and arguments. Its role would be to manage database connections (connect, commit/rollback, disconnect) and delegate actual operations to the methods of PgvectorDocumentStoreConnected.

  2. PgvectorDocumentStoreConnected (or PgvectorDocumentStoreInternal): This class would function like the current PgvectorDocumentStore, but without handling database connections. It would simply have a connection member, likely passed in through its __init__ method.

Advantages of this approach:

  • PgvectorDocumentStoreConnected can be used independently, for example, when a client already has a database connection or wants to manage transactions in a more complex or long-lived manner than a single method call.

Comments and observations much appreciated!

depaolim avatar Nov 18 '24 05:11 depaolim

@depaolim I like the fact that you leave the interface intact and delegate "under the hood" to PgvectorDocumentStoreConnected . Also, this approach has the practical advantage that you could roll out the changes you proposed above quickly, including leaving edge cases uncovered but warned about. I very much like your design!

phiweger avatar Nov 18 '24 08:11 phiweger

Hey guys...

Thanks for all the inputs, discussions and proposals.

I'll have a look later this week, so we can find a way to fix this.

anakin87 avatar Nov 18 '24 12:11 anakin87

I think I have fixed this in #1202. You can find this change in the latest release: https://pypi.org/project/pgvector-haystack/1.1.0/

If you experiment with it, let me know if it solves your problem.

anakin87 avatar Nov 21 '24 09:11 anakin87