pgcopydb pgcopydb stopped with `[TARGET 0] SSL error: bad length` failure during WAL replay

Hi, I'd like to report a failure seen when moving a large database (RDS pg14 to RDS pg14, client side: pgcopydb v0.17 on Ubuntu 22.04)

Invocation:

pgcopydb clone \
--table-jobs 4 \
--index-jobs 2 \
--restore-jobs 8 \
--split-tables-larger-than "25GB" \
--split-max-parts 4 \
--no-role-passwords \
--no-acl \
--no-tablespaces \
--skip-extensions \
--skip-ext-comments \
--skip-collations \
--skip-large-objects \
--skip-vacuum \
--slot-name migration_sre_storage \
--follow \
--verbose &>> logs.txt

Start:

2024-09-12 05:44:51.253 2090 INFO   main.c:136                Running pgcopydb version 0.17-1.pgdg22.04+1 from "/usr/bin/pgcopydb"

After about 5 hours running, it failed with the following output:

2024-09-12 10:37:11.880 2099 INFO   ld_apply.c:459            Replaying changes from file "/root/.local/share/pgcopydb/0000000100016A4800000002.sql"
2024-09-12 10:37:11.915 2099 ERROR  pgsql.c:2330              [TARGET 0] SSL error: bad length
2024-09-12 10:37:11.915 2099 ERROR  pgsql.c:2330              [TARGET 0] SSL SYSCALL error: EOF detected
2024-09-12 10:37:11.915 2099 ERROR  pgsql.c:4822              Failed to setup replication origin transaction at origin LSN 16A48/A08C730 and origin timestamp "2024-09-12 05:57:35.937263+0000"
2024-09-12 10:37:11.915 2099 ERROR  ld_apply.c:818            Failed to setup apply transaction, see above for details
2024-09-12 10:37:11.915 2099 ERROR  ld_apply.c:525            Failed to apply SQL from file "/root/.local/share/pgcopydb/0000000100016A4800000002.sql", see above for details
2024-09-12 10:37:11.915 2099 INFO   follow.c:824              Apply process has terminated
2024-09-12 10:37:11.999 2096 ERROR  follow.c:1057             Subprocess catchup with pid 2099 has exited with error code 12
2024-09-12 10:37:11.999 2096 NOTICE follow.c:1102             Process catchup has exited unexpectedly, and endpos is unset: terminating other processes
2024-09-12 10:37:11.999 2096 NOTICE follow.c:1151             kill -TERM 2097 (prefetch)
2024-09-12 10:37:11.999 2096 NOTICE follow.c:1151             kill -TERM 2098 (transform)
2024-09-12 10:37:11.999 2098 INFO   follow.c:771              Transform process has terminated

The [TARGET 0] in the first error message is suspicious, since instead of 0 we expect the PID of the target backend. The Postgres logs do not show anything at the time of the error, except FATAL: connection to client lost

The same error occurred two times in two independent runs. It did not occur when testing databases smaller or less active in the same environment.

Sep 26 '24 13:09 dverite

@dverite Could you please try the suggestion from https://github.com/dimitri/pgcopydb/issues/773#issuecomment-2139093365?

i.e Modify the tcp keep alive GUC on the target Postgres?

alter system set tcp_keepalives_count=60
alter system set tcp_keepalives_idle=10
alter system set tcp_keepalives_interval=10

alter system probably won't work in RDS, you may need to tweak it on respective instance's Parameter Group.

Oct 04 '24 03:10 arajkumar

This particular migration cannot be retried, it happened on production environment only.

According to the console output, pgcopydb adds keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60 to the URL of the connection.

Independently of that, the server-side parameters that were active were:

       Parameter        | Value 
-------------------------+-------
 tcp_keepalives_count    | 2
 tcp_keepalives_idle     | 300
 tcp_keepalives_interval | 30

Also we did not have any timeout mentioned in the error output, contrary to the case of issue #773

Oct 29 '24 16:10 dverite

Hi,

I am also facing similar issue when using pgcopydb v 0.17. I am getting SSL error: bad length and SSL SYSCALL error: Success when I execute pgcopydb stream set apply command. I tried to set tcp_keepalives_idle, tcp_keepalives_count, tcp_keepalives_interval to the values as mentioned above but still fails with same error.

My environment

Source: VM OS: RHEL8 PostgreSQL version : PostgreSQL v15.8 community edition SSL enabled

Target: VM OS: RHEL8 PostgreSQL version: PostgreSQL v15.8 community edition SSL enabled.

Please let me know any fix or modifications to be made for fixing this issue.

Dec 02 '24 06:12 ARUNTPGSQL

Hello,

we are also facing this issue when we reach the step replying ongoing changes from the source to the database. I was wondering if this might be related to the pipeline mode introduced by the great changes made by @arajkumar , when pcopydb established TLS connection to the target database.

As others already reported, changing keep alive GUC didn't fix the issue , and we were not able to identify the root cause of the connection being dropped. This seems to happen regardless the Postgres server version.

Did anyone find a work-around to this or a possible fix by any chance?

Feb 19 '25 09:02 alessio-form3

@alessio-form3 Is it possible to share the logs from the target instance during the problematic scenario?

Feb 19 '25 09:02 arajkumar

@alessio-form3 Is it possible to share the logs from the target instance during the problematic scenario?

I am not able to copy/paste the logs directly @arajkumar unfortuantely. If you mean the Postgres server logs, they were reporting "connection reset by peer" entries. On the pgcopydb side, errors were exactly like the ones posted in comments above.

I was wondering if the issue might be due to the libpq library itself? Is the version pin-pointed? Apart from the error itself, how would you consider adding the ability to choose, via configuration, whether to use pipeline mode or not?

Thank you.

Feb 19 '25 11:02 alessio-form3

Hello.

The same issue occurred in my environment. pgcopydb is 0.17, PostgreSQL is 16.4. And as expected, when not in pipeline mode, it does not occur even if it connect with SSL. Since it is an SSL error, I checked libpq's SSL transmission. One thing I found out was that libpq does not retry the SSL_write() function when sending with SSL. According to the openssl documentation SSL_write, if the return value of SSL_get_error() is SSL_ERROR_WANT_READ or SSL_ERROR_WANT_WRITE, it must be called again with the same data. In libpq's message sending function pqPutMsgEnd(PGconn *conn), if not all data has been sent and in non-blocking mode, it just returns, but in the libpq's exported API (e.g. PQsendQueryGuts() called by PQsendQueryParams()), pqPutMsgEnd() is called multiple times, so I think the sent data changes. So in the above situation, it needs to be retried with the same data, but it seems that the error occurs because the send data has changed. As a test, I tried to retry if pqsecure_write() returned 0 in pqSendSome(), and it ran in pipeline mode without errors. pqSendSome() is a function which called from pqPutMsgEnd(PGconn *conn) and pqsecure_write() is called from this. In pqsecure_write() SSL_write() is performed. I have attached a patch for PostgreSQL 16.4.

postgresql-16.4-fe-misc-patch.txt

Apr 27 '25 09:04 htgndbat95

Hi @htgndbat95 ; thanks a lot for your contribution and patch here, that's a lot of good work! Unfortunately we won't be able to push that patch to Postgres from this repository. Please consider sending an email to pgsql-bugs or using the bugs report form on the Postgres website to move forward with your patch for libpq here.

Apr 28 '25 12:04 dimitri

Thank you for checking. I have submitted a bug report to PostgreSQL about this.

Apr 29 '25 16:04 htgndbat95

libpq was fixed in the following commit: 137935b

Aug 20 '25 12:08 htgndbat95