langchain
langchain copied to clipboard
Langchain connection to remote DB takes a lot of time
I am using Langchain package to connect to a remote DB. The problem is that it takes a lot of time (sometimes more than 3 minutes) to run the SQLDatabase class. To avoid that long time I am specifying just to load a table but still is taking up to a minute to do that work. Here the code:
from langchain import OpenAI
from langchain.sql_database import SQLDatabase
from sqlalchemy import create_engine
# already loaded environment vars
llm = OpenAI(temperature=0)
engine = create_engine("postgresql+psycopg2://{user}:{passwd}@{host}:{port}/chatdatabase")
include_tables=['table_1']
db = SQLDatabase(engine, include_tables=include_tables)
...
As in the documentation, Langchain uses SQLAlchemy in the background for making connections and loading tables. That is why I tried to make a connection with pure SQLAlchemy and not using langchain:
from sqlalchemy import create_engine
engine = create_engine("postgresql+psycopg2://{user}:{passwd}@{host}:{port}/chatdatabase")
with engine.connect() as con:
rs = con.execute('select * from table_1 limit 10')
for row in rs:
print(row)
And surprisingly it takes just few seconds to do so.
Is there any way or documentation to read (I've searched but not lucky) so that this process can be faster?
How big is your table, how many rows and what size?
It looks like the main issue is with the fact that table_1 is taking a while to import or there is performance issue with the SQL Database Agent itself.
Just a note on your code - the second snippet will naturally be faster as your SELECT query is only returning 10 rows - on the other hand the first snippet is importing a table. Two different functions being performed.
It might be worth logging some other details on here so we can try replicate.
Thank you for answering.
Table_1 has 8 rows and 2 columns. It is used for tests or as a playground. So it also means limit wont affect. One question I have about SQLDatabase is: when I specify include_tables , will it only load that table? The database actually has 466 tables.
Thanks for the additional detail and it's good to rule out a large dataset/unoptimized DB being the issue.
You are correct - by specifying table_1 as a flag you're telling it to only import table 1 and ignore the other 465.
I think it's worth us trying to replicate the issue - I will run some further tests and report back. The documentation doesn't seem to go into much detail about mySQL other than what format the connection string needs to be - https://python.langchain.com/en/latest/modules/chains/examples/sqlite.html - I wonder if it's to do with how LangChain is using sqlalchemy - that seems more likely than there being an issue with the create_engine function
Yes it seems more related to LangChain given the both cases (connecting to the same database using pure SQLAchemy and using LangChain) the latter takes way to much time than the first.
Thank you @Veeeetzzzz :)
I did some further testing with your code to see if I could replicate the timing:
I'm using localhost so I'm now wondering if your remote DB might be a factor? Perhaps some latency issues are at play here?
It's weird that it would run faster when using just SQLAlchemy but I'm not able to replicate the performance issues
I will see if the same happens with a remote DB
Thank you for your time trying to replicate this. That would be great if you connect to a big database(either in amount of tables or columns per table) and try the same.
On Sat, Apr 29, 2023, 16:00 VX3 @.***> wrote:
I did some further testing with your code to see if I could replicate the timing:
[image: image] https://user-images.githubusercontent.com/40268197/235323941-07ac7d85-7802-46ca-88c5-d9956449ad87.png
I'm using localhost so I'm now wondering if your remote DB might be a factor? Perhaps some latency issues are at play here?
It's weird that it would run faster when using just SQLAlchemy but I'm not able to replicate the performance issues
I will see if the same happens with a remote DB
— Reply to this email directly, view it on GitHub https://github.com/hwchase17/langchain/issues/3645#issuecomment-1528870742, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEOFYK6BQ5CKBECC6EZLCKTXDV6NPANCNFSM6AAAAAAXNXG644 . You are receiving this because you authored the thread.Message ID: @.***>
I have the same problem, the line db = SQLDatabase.from_uri(XXXX)
takes more than 5 minutes to connect to a database with more than 200 tables.
That is a lot of time. What would be making this happen
I initially thought network latency might be at play here but I can't seem to replicate - I tested this with a remote PostgreSQL DB on Vercel - 100 rows with data:
Took longer but nothing out of the ordinary:
Can either of you connect to the DB using pgAdmin? It may be worth looking at the SQL DB performance to see if there's any delay on that side.
This is not exactly related to this conversation subject but I can't find a better place to ask this question.
I have my tables in a different schema and not in the public schema. The SQLDatabase agent from langchain is able to fetch the tables from the public schema but not from the the other schema I have created. Does any of you know a solution to this?
Thanks in advance!
I can confirm this issue, I tested langchain from_uri function vs using pure sqlalchemy. The langchain function takes 10 more seconds for connecting to the same database. Could someone take a look?
@caffreyN @dongzeli95 It might be worth opening a new issue - post your code and replication steps as bug, might need a PR here.
Its taking a much longer time to connect via the langchain API as well compared to the SQL Alchemy an SQLlite APIs.
I am using db = SQLDatabase.from_uri(conn_str)
function to connect to the database. It's taking around 13 minutes to connect.
There are 140 tables in the database and each table has many columns. May be the size of database affects the connection time? Did anyone find the solution to this?
I think there's other factors at play - I connect to a local/remote DB from both local and an hosted environment and my DB has 100+ rows. All operations were completed within 5 seconds but other users said they could only see the delay when Langchain was invoked.
I had the same issue. The langchain SQLDatabase Class has these lines at the end which inspects every table in the database irrespective of "include_tables" list. These lines don't assign anything to anything. Commenting these out made the connection < 0.5 sec.
self._max_string_length = max_string_length
self._metadata = metadata or MetaData()
# including view support if view_support = true
# self._metadata.reflect(
# views=view_support,
# bind=self._engine,
# only=list(self._usable_tables),
# schema=self._schema,
# )
Thanks to @samsingh12 for the hint, this must be fixed.
Trying the following and seems to have no effect in the execution:
class NoReflectMetadata(MetaData):
def reflect(
self,
bind=None,
schema=None,
views=False,
only=None,
extend_existing=False,
autoload_replace=True,
resolve_fks=True,
**dialect_kwargs,
):
pass
usage:
SQLDatabase.from_uri(db_url, metadata=NoReflectMetadata())
Inside SQLDatabase class, self._metadata.reflect()
loads information about the specified tables and views, and stores that information in the MetaData object associated with the current SQLAlchemy session. If you comment out this function as mentioned by @samsingh12 then you won't be able to use class methods such as db.get_table_info()
. So if you want to use these methods then assign self.include_tables
to only
in self._metadata.reflect()
as follows;
self._metadata.reflect(
views=view_support,
bind=self._engine,
only=self._include_tables,
schema=self._schema,
)
In this way you can use class methods for the tables included without get delayed response.
@Veeeetzzzz I have been having the same issue and followed the advice here:
https://stackoverflow.com/questions/76525263/langchains-sqldatabasesequentialchain-to-query-database
However, it seems I cannot query any views
Sharing the way we worked around this issue, in case it helps anyone. As @Ali-Raza12 mentioned the suggested approach was not complete. In our case we don't know what tables are we going to be fetching ahead of time, so instead of not allowing to fetch metadata we deferred the initialization of them to be loaded on demand.
from langchain.sql_database import SQLDatabase as BaseSQLDatabase
class SQLDatabase(BaseSQLDatabase):
"""
This class takes the Langchain SQLDatabase class and overrides the
get_table_info method to allow lazy loading of metadata tables.
"""
def __init__(self, *args, **kwargs):
super(SQLDatabase, self).__init__(*args, **kwargs)
def get_table_info(self, table_names: Optional[List[str]] = None) -> str:
if table_names is None:
return ""
loaded_tables = self._metadata.tables
tables_to_load = []
for table_name in table_names:
if table_name not in loaded_tables:
tables_to_load.append(table_name)
if len(tables_to_load) > 0:
self._metadata.reflect(bind=self._engine, only=tables_to_load)
return super().get_table_info(table_names)
class LazyReflectMetadata(MetaData):
def __init__(self):
super().__init__()
self._initial_reflect = True
def reflect(
self,
bind=None,
schema=None,
views=False,
only=None,
extend_existing=False,
autoload_replace=True,
resolve_fks=True,
**dialect_kwargs,
):
if self._initial_reflect:
logging.debug("Ignoring _initial_reflect due to initialization.")
self._initial_reflect = False
else:
logging.debug("Calling reflect with tables=%s", only)
return super().reflect(
bind,
schema,
views,
only,
extend_existing,
autoload_replace,
resolve_fks,
**dialect_kwargs,
)
usage:
....
db = SQLDatabase(engine, metadata=LazyReflectMetadata(), **kwargs)
A little hacky on the way we initialize LazyReflectMetadata
but it worked for us to fetch the tables on demand and sped up the db initialization.
Hi, @maby200! I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.
Based on my understanding, the issue you raised is about the connection time to a remote database using the Langchain package. You mentioned that it takes more than 3 minutes, but when you tried using pure SQLAlchemy, it only took a few seconds. In the comments, some users suggested that the issue might be related to the size of the database or latency issues with the remote DB. There have also been suggestions for potential workarounds and solutions to improve the connection time.
Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.
Thank you for your contribution to the LangChain repository!
Hi, It is ok now. The problem did not appeared again. Thank you :)
Thank you, @maby200, for closing the issue on the LangChain repository! We appreciate your contribution.
@Veeeetzzzz I have been having the same issue and followed the advice here:
https://stackoverflow.com/questions/76525263/langchains-sqldatabasesequentialchain-to-query-database
However, it seems I cannot query any views
Were you able to query the view
@Veeeetzzzz I have been having the same issue and followed the advice here: https://stackoverflow.com/questions/76525263/langchains-sqldatabasesequentialchain-to-query-database However, it seems I cannot query any views
Were you able to query the view
Hello @NisaarAgharia
I've never had any issues with the initial import taking longer than 5s and I've never had to modify the init function so I'm not entirely sure what the issue was here, I'm guessing it will depend on the data in your DB + other factors like remote db latency etc.
Given that the OP's issue's resolved itself - I can only assume OP's issue was down to the remote DB connection
Would recommend opening a new issue + posting your code for a review
Edit: it seems like some people are having luck with this setting lazy_table_reflection to TRUE. The database schema will only be loaded when it's needed, rather than immediately. This can be more efficient, especially for large databases however I would err on the side of caution as I could not replicate this delay on a very small DB. I would expect this to work for larger DBs but it's worth trying this + combination of other mitigations before considering it a bug.
Hi, It is ok now. The problem did not appeared again. Thank you :)
@maby200 Can you please tell how did you fix this? I am also facing similar issue while connecting redshift db cluster to langchain.
Hi, It is ok now. The problem did not appeared again. Thank you :)
@maby200 Can you please tell how did you fix this? I am also facing similar issue while connecting redshift db cluster to langchain.
like https://stackoverflow.com/questions/76525263/langchains-sqldatabasesequentialchain-to-query-database mentioned
but we dont need to comment source code of SQLDatabase and pass "lazy_table_reflection" will also achieve the same thing:
from sqlalchemy import create_engine
engine = create_engine("mysql://{user}:{pass}@{url}/database")
from langchain_community.utilities import SQLDatabase
DB= SQLDatabase(engine=engine, lazy_table_reflection=True)