langgraph icon indicating copy to clipboard operation
langgraph copied to clipboard

langgraph-checkpoint-postgres (psycopg.OperationalError: sending query and params failed: SSL error: bad length) encountered across multiple version

Open shubhamnegikellton opened this issue 9 months ago • 10 comments

Checked other resources

  • [x] This is a bug, not a usage question. For questions, please use GitHub Discussions.
  • [x] I added a clear and detailed title that summarizes the issue.
  • [x] I read what a minimal reproducible example is (https://stackoverflow.com/help/minimal-reproducible-example).
  • [x] I included a self-contained, minimal example that demonstrates the issue INCLUDING all the relevant imports. The code run AS IS to reproduce the issue.

Example Code

from psycopg import Connection
from psycopg_pool import ConnectionPool
from psycopg.rows import dict_row
from langgraph.checkpoint.postgres import PostgresSaver

connection_kwargs = {"autocommit": True, "prepare_threshold": 0}

async with AsyncConnectionPool(conninfo=conninfo, max_size=20, kwargs=connection_kwargs) as pool:
           graph = create_react_agent(
                llm,
                build_tools,
                messages_modifier=_modify_messages,
                checkpointer=AsyncPostgresSaver(pool),  # type:ignore[arg-type]
            )

            async for event in graph.astream_events(
                {"messages": [("human", search_params.question)]},
                config={"configurable": {"thread_id": conversation_id, "recursion_limit": 20}},
                stream_mode="values",
                version="v2",
            ):

Error Message and Stack Trace (if applicable)

INSERT INTO checkpoints ( thread_id, checkpoint_ns, checkpoint_id, parent_checkpoint_id, checkpoint, metadata ) 
VALUES ( ? ) ON CONFLICT ( thread_id, checkpoint_ns, checkpoint_id ) DO 
UPDATE SET checkpoint = EXCLUDED.checkpoint, metadata = EXCLUDED.metadata

psycopg.OperationalError: sending query and params failed: SSL error: bad length
 File "/app/app/search/ai_models.py", line 315, in chat
    async for event in graph.astream_events(
  File "/app/venv/lib/python3.12/site-packages/langchain_core/runnables/base.py", line 1386, in astream_events
    async for event in event_stream:
  File "/app/venv/lib/python3.12/site-packages/langchain_core/tracers/event_stream.py", line 1012, in _astream_events_implementation_v2
    await task
  File "/app/venv/lib/python3.12/site-packages/langchain_core/tracers/event_stream.py", line 967, in consume_astream
    async for _ in event_streamer.tap_output_aiter(run_id, stream):
  File "/app/venv/lib/python3.12/site-packages/langchain_core/tracers/event_stream.py", line 203, in tap_output_aiter
    async for chunk in output:
  File "/app/venv/lib/python3.12/site-packages/langgraph/pregel/__init__.py", line 1832, in astream
    async with AsyncPregelLoop(
               ^^^^^^^^^^^^^^^^
  File "/app/venv/lib/python3.12/site-packages/langgraph/pregel/loop.py", line 1035, in __aexit__
    return await asyncio.shield(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/contextlib.py", line 754, in __aexit__
    raise exc_details[1]
  File "/usr/local/lib/python3.12/contextlib.py", line 737, in __aexit__
    cb_suppress = await cb(*exc_details)
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/app/venv/lib/python3.12/site-packages/langgraph/pregel/executor.py", line 200, in __aexit__
    raise exc
  File "/app/venv/lib/python3.12/site-packages/langgraph/pregel/loop.py", line 957, in _checkpointer_put_after_previous
    await cast(BaseCheckpointSaver, self.checkpointer).aput(
  File "/app/venv/lib/python3.12/site-packages/langgraph/checkpoint/postgres/aio.py", line 270, in aput
    await cur.execute(
  File "/app/venv/lib/python3.12/site-packages/ddtrace/contrib/dbapi_async.py", line 136, in execute
    return await self._trace_method(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/venv/lib/python3.12/site-packages/ddtrace/contrib/dbapi_async.py", line 105, in _trace_method
    return await method(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/venv/lib/python3.12/site-packages/psycopg/cursor_async.py", line 97, in execute
    raise ex.with_traceback(None)
psycopg.OperationalError: sending query and params failed: SSL error: bad length
SSL SYSCALL error: EOF detected

Description

Faced the below issue with langgraph-checkpoint-postgres:

psycopg.OperationalError: sending query and params failed: SSL error: bad length SSL SYSCALL error: EOF detected

Image

NOTE: I have tried with multiple langgraph-checkpoint-postgres i.e 2.0.11, 2.0.9, 2.0.13, 2.0.15

System Info

langchain = "0.3.11" langchain-community = "0.3.11" langchain-experimental = "0.3.3" langchain-openai = "0.2.12" langchain-postgres = "0.0.12" langgraph = "0.2.58" psycopg = { extras = ["binary"], version = "3.2.3" } psycopg-pool = "3.2.3" sqlalchemy = { version = "2.0.36", extras = ["asyncio"] } sqlmodel = "0.0.22" asyncpg = "0.30.0" langgraph-checkpoint-postgres = "2.0.15"

shubhamnegikellton avatar Mar 06 '25 12:03 shubhamnegikellton

I have the exact same issue.

Some times, and I can't figure out exactly when to properly reproduce it, the following is logged:

2025-03-06 14:00:53 | W |              psycopg | error ignored terminating <psycopg.AsyncPipeline [BAD] at 0xffff482543e0>: the connection is lost
2025-03-06 14:00:53 | W |         psycopg.pool | discarding closed connection: <psycopg.AsyncConnection [BAD] at 0xffff6408b080>

And then there is a failure:

SSL error: bad length
SSL SYSCALL error: EOF detected
Traceback (most recent call last):
  File "/app/dojo/common/asyncio.py", line 39, in wrap_event_iterator_with_keep_alive
    event = event_task.result()
            ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/langchain_core/runnables/base.py", line 1389, in astream_events
    async for event in event_stream:
  File "/usr/local/lib/python3.12/site-packages/langchain_core/tracers/event_stream.py", line 1012, in _astream_events_implementation_v2
    await task
  File "/usr/local/lib/python3.12/site-packages/langchain_core/tracers/event_stream.py", line 967, in consume_astream
    async for _ in event_streamer.tap_output_aiter(run_id, stream):
  File "/usr/local/lib/python3.12/site-packages/langchain_core/tracers/event_stream.py", line 203, in tap_output_aiter
    async for chunk in output:
  File "/usr/local/lib/python3.12/site-packages/langgraph/pregel/__init__.py", line 2227, in astream
    async with AsyncPregelLoop(
               ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/langgraph/pregel/loop.py", line 1109, in __aexit__
    return await exit_task
           ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/contextlib.py", line 754, in __aexit__
    raise exc_details[1]
  File "/usr/local/lib/python3.12/contextlib.py", line 737, in __aexit__
    cb_suppress = await cb(*exc_details)
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/langgraph/pregel/executor.py", line 206, in __aexit__
    raise exc
  File "/usr/local/lib/python3.12/site-packages/langgraph/pregel/loop.py", line 1023, in _checkpointer_put_after_previous
    await prev
  File "/usr/local/lib/python3.12/site-packages/langgraph/pregel/loop.py", line 1025, in _checkpointer_put_after_previous
    await cast(BaseCheckpointSaver, self.checkpointer).aput(
  File "/usr/local/lib/python3.12/site-packages/langgraph/checkpoint/postgres/aio.py", line 261, in aput
    await cur.executemany(
  File "/usr/local/lib/python3.12/site-packages/psycopg/cursor_async.py", line 132, in executemany
    raise ex.with_traceback(None)
psycopg.OperationalError: sending prepared query failed: SSL error: bad length
SSL SYSCALL error: EOF detected

antonioalegria avatar Mar 06 '25 14:03 antonioalegria

any chance you might be running out of disk space? saw this https://stackoverflow.com/questions/63028407/psycopg2-databaseerror-ssl-error-bad-length, not 100% sure if this is relevant

vbarda avatar Mar 06 '25 18:03 vbarda

At least on my side I don't think that's it. This only happens sometimes, not on all requests...

The log messages below always precede the psycopg.OperationalError. Those two messages appear and then, when after that a request comes and is processed it blows up like this in the middle of processing when it's storing the checkpoint.

2025-03-06 14:00:53 | W |              psycopg | error ignored terminating <psycopg.AsyncPipeline [BAD] at 0xffff482543e0>: the connection is lost
2025-03-06 14:00:53 | W |         psycopg.pool | discarding closed connection: <psycopg.AsyncConnection [BAD] at 0xffff6408b080>

If at least we could handle this issue, then at least we could mitigate it. Like this some requests fail because it blows up in the middle with no way to deal with it.

antonioalegria avatar Mar 06 '25 21:03 antonioalegria

@antonioalegria any chance you have really large messages in react agent? potentially tool messages with large content?

vbarda avatar Mar 06 '25 23:03 vbarda

Yes, there is that chance, definitely.

antonioalegria avatar Mar 07 '25 08:03 antonioalegria

Yes, there is that chance, definitely.

I'm also exploring this idea.. great chat btw

cloudlessdreams avatar Mar 07 '25 11:03 cloudlessdreams

Do you need any more info?

antonioalegria avatar Mar 14 '25 14:03 antonioalegria

Are there updates on this issue? Thank you so much!

antonioalegria avatar Mar 27 '25 09:03 antonioalegria

We're occasionally facing the same issue and haven't been able to reproduce it locally. Last time the error occurred, AWS RDS monitoring showed that a single database connection dropped at the same moment. We're not sure if that's the cause or just a side effect, but thought it might be useful context.

There were 15GB of free disk space at the time. It happened on the first HumanMessage with the content "hi", so it's unlikely to be related to message size.

langgraph: 0.2.67
langgraph-checkpoint: 2.0.24
langgraph-checkpoint-postgres: 2.0.13

nadavperetz avatar Apr 10 '25 11:04 nadavperetz

That's quite interesting @nadavperetz .. thanks for clarifying the context size issue.

cloudlessdreams avatar Apr 10 '25 13:04 cloudlessdreams

Same error here, any ideas on causes or solutions?

jdg9vr avatar May 16 '25 14:05 jdg9vr

Did you try to update the langgraph-checkpoint-postgres with the latest version ?

Louis-Melliorat avatar May 21 '25 14:05 Louis-Melliorat

Any update or has anybody else been able to resolve this? Seeing this issue with the latest version as well.

theory2 avatar May 22 '25 10:05 theory2

Any update on this? I am experiencing the same issue. I will note that recently in my application I have started using a data lookup in a prompt that gets a large amount of text (100,000+ characters). Don't know if its related but the issue is intermittent. I get the same 2 errors as others have mentioned:

OperationalError sending query and params failed: SSL error: bad length SSL SYSCALL error: EOF detected

Logged error: error ignored terminating <psycopg.Pipeline [BAD] at 0x1f430b99d00>: the connection is lost discarding closed connection: <psycopg.Connection [BAD] at 0x1f42668ae70>

TheTreeHacker avatar May 22 '25 18:05 TheTreeHacker

Additionally, if I remove the large prompt the issue seems to go away. Is there a limit on message size in the checkpoint system? Environment: langgraph==0.3.30 psycopg[binary,pool]==3.2.9 langgraph-checkpoint-postgres == 2.0.21

TheTreeHacker avatar May 22 '25 18:05 TheTreeHacker

Additionally, if I remove the large prompt the issue seems to go away. Is there a limit on message size in the checkpoint system? Environment: langgraph==0.3.30 psycopg[binary,pool]==3.2.9 langgraph-checkpoint-postgres == 2.0.21

In my case, there is a large amount of data stored in the state and not so much in the prompt/messages. The confusing part is the intermittent nature of this issue as it is not consistent. Still testing to confirm, but it seems this issue is less likely to occur when checkpoint.setup doesn't need to setup new tables.

theory2 avatar May 23 '25 06:05 theory2

@theory2 @TheTreeHacker is this occuring for you both when using a cloud-hosted postgres database? What about a local postgres database with docker? I found this was occuring using cloud databases (Azure and AWS), but not for local dbs.

jdg9vr avatar May 27 '25 18:05 jdg9vr

@jdg9vr Yes I am using a postgres db v15.12 in Azure. I haven't used a local version

TheTreeHacker avatar May 27 '25 18:05 TheTreeHacker

@theory2 @TheTreeHacker is this occuring for you both when using a cloud-hosted postgres database? What about a local postgres database with docker? I found this was occuring using cloud databases (Azure and AWS), but not for local dbs.

Yes, I'm seeing this with Postgres on RDS. Haven't tested local.

theory2 avatar May 27 '25 20:05 theory2

same issue

kashyap-aditya avatar May 30 '25 17:05 kashyap-aditya

Cloud database for me as well.

antonioalegria avatar Jun 04 '25 10:06 antonioalegria

This SSL error could definitely be masking other underlying issues therefore it wouldn't necessarily be raised locally. The SSL connection failure appears to be a symptom rather than the root cause.

cloudlessdreams avatar Jun 04 '25 10:06 cloudlessdreams

@vbarda Is anyone looking into this at Langchain?

TheTreeHacker avatar Jun 04 '25 11:06 TheTreeHacker

I am also experiencing this issue

savvaki avatar Jun 05 '25 08:06 savvaki

At the minimum we need a workaround or a way to be able to recover from these errors without failing to process the whole thing

antonioalegria avatar Jun 05 '25 14:06 antonioalegria

I'm experience the same issue on the cloud

Joao-Tiago-Almeida avatar Jun 05 '25 15:06 Joao-Tiago-Almeida

I'm getting the same error. Any help would be greatly appreciated.

Prageethcs avatar Jun 17 '25 05:06 Prageethcs

Would love to get some update on this from the team - if it's being addressed, if there is a workaround, etc. Thank you!

antonioalegria avatar Jun 20 '25 11:06 antonioalegria

@antonioalegria I second this^

TheTreeHacker avatar Jun 20 '25 12:06 TheTreeHacker

Seconding this too!

Leo310 avatar Jun 23 '25 08:06 Leo310