langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Langchain connection to remote DB takes a lot of time

Open maby200 opened this issue 1 year ago • 13 comments

I am using Langchain package to connect to a remote DB. The problem is that it takes a lot of time (sometimes more than 3 minutes) to run the SQLDatabase class. To avoid that long time I am specifying just to load a table but still is taking up to a minute to do that work. Here the code:

from langchain import OpenAI
from langchain.sql_database import SQLDatabase
from sqlalchemy import create_engine

# already loaded environment vars 

llm = OpenAI(temperature=0)

engine = create_engine("postgresql+psycopg2://{user}:{passwd}@{host}:{port}/chatdatabase")

include_tables=['table_1']

db = SQLDatabase(engine, include_tables=include_tables)

...

As in the documentation, Langchain uses SQLAlchemy in the background for making connections and loading tables. That is why I tried to make a connection with pure SQLAlchemy and not using langchain:

from sqlalchemy import create_engine

engine = create_engine("postgresql+psycopg2://{user}:{passwd}@{host}:{port}/chatdatabase")

with engine.connect() as con:

  rs = con.execute('select * from table_1 limit 10')
  for row in rs:
    print(row)

And surprisingly it takes just few seconds to do so.

Is there any way or documentation to read (I've searched but not lucky) so that this process can be faster?

maby200 avatar Apr 27 '23 11:04 maby200

How big is your table, how many rows and what size?

It looks like the main issue is with the fact that table_1 is taking a while to import or there is performance issue with the SQL Database Agent itself.

Just a note on your code - the second snippet will naturally be faster as your SELECT query is only returning 10 rows - on the other hand the first snippet is importing a table. Two different functions being performed.

It might be worth logging some other details on here so we can try replicate.

Veeeetzzzz avatar Apr 27 '23 22:04 Veeeetzzzz

Thank you for answering.

Table_1 has 8 rows and 2 columns. It is used for tests or as a playground. So it also means limit wont affect. One question I have about SQLDatabase is: when I specify include_tables , will it only load that table? The database actually has 466 tables.

maby200 avatar Apr 27 '23 22:04 maby200

Thanks for the additional detail and it's good to rule out a large dataset/unoptimized DB being the issue.

You are correct - by specifying table_1 as a flag you're telling it to only import table 1 and ignore the other 465.

I think it's worth us trying to replicate the issue - I will run some further tests and report back. The documentation doesn't seem to go into much detail about mySQL other than what format the connection string needs to be - https://python.langchain.com/en/latest/modules/chains/examples/sqlite.html - I wonder if it's to do with how LangChain is using sqlalchemy - that seems more likely than there being an issue with the create_engine function

Veeeetzzzz avatar Apr 28 '23 09:04 Veeeetzzzz

Yes it seems more related to LangChain given the both cases (connecting to the same database using pure SQLAchemy and using LangChain) the latter takes way to much time than the first.

Thank you @Veeeetzzzz :)

maby200 avatar Apr 28 '23 11:04 maby200

I did some further testing with your code to see if I could replicate the timing:

image

I'm using localhost so I'm now wondering if your remote DB might be a factor? Perhaps some latency issues are at play here?

It's weird that it would run faster when using just SQLAlchemy but I'm not able to replicate the performance issues

I will see if the same happens with a remote DB

Veeeetzzzz avatar Apr 29 '23 20:04 Veeeetzzzz

Thank you for your time trying to replicate this. That would be great if you connect to a big database(either in amount of tables or columns per table) and try the same.

On Sat, Apr 29, 2023, 16:00 VX3 @.***> wrote:

I did some further testing with your code to see if I could replicate the timing:

[image: image] https://user-images.githubusercontent.com/40268197/235323941-07ac7d85-7802-46ca-88c5-d9956449ad87.png

I'm using localhost so I'm now wondering if your remote DB might be a factor? Perhaps some latency issues are at play here?

It's weird that it would run faster when using just SQLAlchemy but I'm not able to replicate the performance issues

I will see if the same happens with a remote DB

— Reply to this email directly, view it on GitHub https://github.com/hwchase17/langchain/issues/3645#issuecomment-1528870742, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEOFYK6BQ5CKBECC6EZLCKTXDV6NPANCNFSM6AAAAAAXNXG644 . You are receiving this because you authored the thread.Message ID: @.***>

maby200 avatar Apr 29 '23 21:04 maby200

I have the same problem, the line db = SQLDatabase.from_uri(XXXX) takes more than 5 minutes to connect to a database with more than 200 tables.

tdnueve avatar Apr 30 '23 11:04 tdnueve

That is a lot of time. What would be making this happen

maby200 avatar May 02 '23 21:05 maby200

I initially thought network latency might be at play here but I can't seem to replicate - I tested this with a remote PostgreSQL DB on Vercel - 100 rows with data:

image

Took longer but nothing out of the ordinary:

image

Can either of you connect to the DB using pgAdmin? It may be worth looking at the SQL DB performance to see if there's any delay on that side.

image

Veeeetzzzz avatar May 03 '23 10:05 Veeeetzzzz

This is not exactly related to this conversation subject but I can't find a better place to ask this question.

I have my tables in a different schema and not in the public schema. The SQLDatabase agent from langchain is able to fetch the tables from the public schema but not from the the other schema I have created. Does any of you know a solution to this?

Thanks in advance!

caffreyN avatar May 05 '23 12:05 caffreyN

I can confirm this issue, I tested langchain from_uri function vs using pure sqlalchemy. The langchain function takes 10 more seconds for connecting to the same database. Could someone take a look?

dongzeli95 avatar May 06 '23 22:05 dongzeli95

@caffreyN @dongzeli95 It might be worth opening a new issue - post your code and replication steps as bug, might need a PR here.

Veeeetzzzz avatar May 09 '23 12:05 Veeeetzzzz

Its taking a much longer time to connect via the langchain API as well compared to the SQL Alchemy an SQLlite APIs.

nirav0999 avatar May 17 '23 07:05 nirav0999

I am using db = SQLDatabase.from_uri(conn_str) function to connect to the database. It's taking around 13 minutes to connect. There are 140 tables in the database and each table has many columns. May be the size of database affects the connection time? Did anyone find the solution to this?

naam-nrj avatar Jun 23 '23 13:06 naam-nrj

I think there's other factors at play - I connect to a local/remote DB from both local and an hosted environment and my DB has 100+ rows. All operations were completed within 5 seconds but other users said they could only see the delay when Langchain was invoked.

Veeeetzzzz avatar Jun 28 '23 12:06 Veeeetzzzz

I had the same issue. The langchain SQLDatabase Class has these lines at the end which inspects every table in the database irrespective of "include_tables" list. These lines don't assign anything to anything. Commenting these out made the connection < 0.5 sec.

    self._max_string_length = max_string_length
    self._metadata = metadata or MetaData()
    # including view support if view_support = true
    # self._metadata.reflect(
    #     views=view_support,
    #     bind=self._engine,
    #     only=list(self._usable_tables),
    #     schema=self._schema,
    # )

samsingh12 avatar Jul 14 '23 14:07 samsingh12

Thanks to @samsingh12 for the hint, this must be fixed.

Trying the following and seems to have no effect in the execution:

class NoReflectMetadata(MetaData):
    def reflect(
        self,
        bind=None,
        schema=None,
        views=False,
        only=None,
        extend_existing=False,
        autoload_replace=True,
        resolve_fks=True,
        **dialect_kwargs,
    ):
        pass

usage:

 SQLDatabase.from_uri(db_url, metadata=NoReflectMetadata())

josebarrueta avatar Jul 22 '23 02:07 josebarrueta

Inside SQLDatabase class, self._metadata.reflect() loads information about the specified tables and views, and stores that information in the MetaData object associated with the current SQLAlchemy session. If you comment out this function as mentioned by @samsingh12 then you won't be able to use class methods such as db.get_table_info(). So if you want to use these methods then assign self.include_tables to only in self._metadata.reflect() as follows;

self._metadata.reflect(
    views=view_support,
    bind=self._engine,
    only=self._include_tables,
    schema=self._schema,
)

In this way you can use class methods for the tables included without get delayed response.

Ali-Raza12 avatar Aug 02 '23 15:08 Ali-Raza12

@Veeeetzzzz I have been having the same issue and followed the advice here:

https://stackoverflow.com/questions/76525263/langchains-sqldatabasesequentialchain-to-query-database

However, it seems I cannot query any views

tmbluth avatar Aug 02 '23 16:08 tmbluth

Sharing the way we worked around this issue, in case it helps anyone. As @Ali-Raza12 mentioned the suggested approach was not complete. In our case we don't know what tables are we going to be fetching ahead of time, so instead of not allowing to fetch metadata we deferred the initialization of them to be loaded on demand.

from langchain.sql_database import SQLDatabase as BaseSQLDatabase

class SQLDatabase(BaseSQLDatabase):
    """
    This class takes the Langchain SQLDatabase class and overrides the
    get_table_info method to allow lazy loading of metadata tables.
    """

    def __init__(self, *args, **kwargs):
        super(SQLDatabase, self).__init__(*args, **kwargs)

    def get_table_info(self, table_names: Optional[List[str]] = None) -> str:
        if table_names is None:
            return ""

        loaded_tables = self._metadata.tables

        tables_to_load = []

        for table_name in table_names:
            if table_name not in loaded_tables:
                tables_to_load.append(table_name)

        if len(tables_to_load) > 0:
            self._metadata.reflect(bind=self._engine, only=tables_to_load)

        return super().get_table_info(table_names)


class LazyReflectMetadata(MetaData):
    def __init__(self):
        super().__init__()
        self._initial_reflect = True

    def reflect(
        self,
        bind=None,
        schema=None,
        views=False,
        only=None,
        extend_existing=False,
        autoload_replace=True,
        resolve_fks=True,
        **dialect_kwargs,
    ):
        if self._initial_reflect:
            logging.debug("Ignoring _initial_reflect due to initialization.")
            self._initial_reflect = False
        else:
            logging.debug("Calling reflect with tables=%s", only)
            return super().reflect(
                bind,
                schema,
                views,
                only,
                extend_existing,
                autoload_replace,
                resolve_fks,
                **dialect_kwargs,
            )

usage:

        ....
        db =  SQLDatabase(engine, metadata=LazyReflectMetadata(), **kwargs)

A little hacky on the way we initialize LazyReflectMetadata but it worked for us to fetch the tables on demand and sped up the db initialization.

josebarrueta avatar Aug 02 '23 16:08 josebarrueta

Hi, @maby200! I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

Based on my understanding, the issue you raised is about the connection time to a remote database using the Langchain package. You mentioned that it takes more than 3 minutes, but when you tried using pure SQLAlchemy, it only took a few seconds. In the comments, some users suggested that the issue might be related to the size of the database or latency issues with the remote DB. There have also been suggestions for potential workarounds and solutions to improve the connection time.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your contribution to the LangChain repository!

dosubot[bot] avatar Nov 01 '23 16:11 dosubot[bot]

Hi, It is ok now. The problem did not appeared again. Thank you :)

maby200 avatar Nov 02 '23 15:11 maby200

Thank you, @maby200, for closing the issue on the LangChain repository! We appreciate your contribution.

dosubot[bot] avatar Nov 02 '23 15:11 dosubot[bot]

@Veeeetzzzz I have been having the same issue and followed the advice here:

https://stackoverflow.com/questions/76525263/langchains-sqldatabasesequentialchain-to-query-database

However, it seems I cannot query any views

Were you able to query the view

NisaarAgharia avatar Jan 06 '24 04:01 NisaarAgharia

@Veeeetzzzz I have been having the same issue and followed the advice here: https://stackoverflow.com/questions/76525263/langchains-sqldatabasesequentialchain-to-query-database However, it seems I cannot query any views

Were you able to query the view

Hello @NisaarAgharia

I've never had any issues with the initial import taking longer than 5s and I've never had to modify the init function so I'm not entirely sure what the issue was here, I'm guessing it will depend on the data in your DB + other factors like remote db latency etc.

Given that the OP's issue's resolved itself - I can only assume OP's issue was down to the remote DB connection

Would recommend opening a new issue + posting your code for a review

Edit: it seems like some people are having luck with this setting lazy_table_reflection to TRUE. The database schema will only be loaded when it's needed, rather than immediately. This can be more efficient, especially for large databases however I would err on the side of caution as I could not replicate this delay on a very small DB. I would expect this to work for larger DBs but it's worth trying this + combination of other mitigations before considering it a bug.

Veeeetzzzz avatar Jan 06 '24 19:01 Veeeetzzzz

Hi, It is ok now. The problem did not appeared again. Thank you :)

@maby200 Can you please tell how did you fix this? I am also facing similar issue while connecting redshift db cluster to langchain.

Adyansh-RCRM avatar Jul 15 '24 10:07 Adyansh-RCRM

Hi, It is ok now. The problem did not appeared again. Thank you :)

@maby200 Can you please tell how did you fix this? I am also facing similar issue while connecting redshift db cluster to langchain.

like https://stackoverflow.com/questions/76525263/langchains-sqldatabasesequentialchain-to-query-database mentioned

but we dont need to comment source code of SQLDatabase and pass "lazy_table_reflection" will also achieve the same thing:

from sqlalchemy import create_engine
engine = create_engine("mysql://{user}:{pass}@{url}/database")
from langchain_community.utilities import SQLDatabase
DB= SQLDatabase(engine=engine, lazy_table_reflection=True)

TinyCalf avatar Jul 30 '24 08:07 TinyCalf