pgcopydb stopped with `[TARGET 0] SSL error: bad length` failure during WAL replay
Hi, I'd like to report a failure seen when moving a large database (RDS pg14 to RDS pg14, client side: pgcopydb v0.17 on Ubuntu 22.04)
Invocation:
pgcopydb clone \
--table-jobs 4 \
--index-jobs 2 \
--restore-jobs 8 \
--split-tables-larger-than "25GB" \
--split-max-parts 4 \
--no-role-passwords \
--no-acl \
--no-tablespaces \
--skip-extensions \
--skip-ext-comments \
--skip-collations \
--skip-large-objects \
--skip-vacuum \
--slot-name migration_sre_storage \
--follow \
--verbose &>> logs.txt
Start:
2024-09-12 05:44:51.253 2090 INFO main.c:136 Running pgcopydb version 0.17-1.pgdg22.04+1 from "/usr/bin/pgcopydb"
After about 5 hours running, it failed with the following output:
2024-09-12 10:37:11.880 2099 INFO ld_apply.c:459 Replaying changes from file "/root/.local/share/pgcopydb/0000000100016A4800000002.sql"
2024-09-12 10:37:11.915 2099 ERROR pgsql.c:2330 [TARGET 0] SSL error: bad length
2024-09-12 10:37:11.915 2099 ERROR pgsql.c:2330 [TARGET 0] SSL SYSCALL error: EOF detected
2024-09-12 10:37:11.915 2099 ERROR pgsql.c:4822 Failed to setup replication origin transaction at origin LSN 16A48/A08C730 and origin timestamp "2024-09-12 05:57:35.937263+0000"
2024-09-12 10:37:11.915 2099 ERROR ld_apply.c:818 Failed to setup apply transaction, see above for details
2024-09-12 10:37:11.915 2099 ERROR ld_apply.c:525 Failed to apply SQL from file "/root/.local/share/pgcopydb/0000000100016A4800000002.sql", see above for details
2024-09-12 10:37:11.915 2099 INFO follow.c:824 Apply process has terminated
2024-09-12 10:37:11.999 2096 ERROR follow.c:1057 Subprocess catchup with pid 2099 has exited with error code 12
2024-09-12 10:37:11.999 2096 NOTICE follow.c:1102 Process catchup has exited unexpectedly, and endpos is unset: terminating other processes
2024-09-12 10:37:11.999 2096 NOTICE follow.c:1151 kill -TERM 2097 (prefetch)
2024-09-12 10:37:11.999 2096 NOTICE follow.c:1151 kill -TERM 2098 (transform)
2024-09-12 10:37:11.999 2098 INFO follow.c:771 Transform process has terminated
The [TARGET 0] in the first error message is suspicious, since instead of 0 we expect the PID of the target backend.
The Postgres logs do not show anything at the time of the error, except FATAL: connection to client lost
The same error occurred two times in two independent runs. It did not occur when testing databases smaller or less active in the same environment.
@dverite Could you please try the suggestion from https://github.com/dimitri/pgcopydb/issues/773#issuecomment-2139093365?
i.e Modify the tcp keep alive GUC on the target Postgres?
alter system set tcp_keepalives_count=60
alter system set tcp_keepalives_idle=10
alter system set tcp_keepalives_interval=10
alter system probably won't work in RDS, you may need to tweak it on respective instance's Parameter Group.
This particular migration cannot be retried, it happened on production environment only.
According to the console output, pgcopydb adds keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60 to the URL of the connection.
Independently of that, the server-side parameters that were active were:
Parameter | Value
-------------------------+-------
tcp_keepalives_count | 2
tcp_keepalives_idle | 300
tcp_keepalives_interval | 30
Also we did not have any timeout mentioned in the error output, contrary to the case of issue #773
Hi,
I am also facing similar issue when using pgcopydb v 0.17. I am getting SSL error: bad length and SSL SYSCALL error: Success when I execute pgcopydb stream set apply command. I tried to set tcp_keepalives_idle, tcp_keepalives_count, tcp_keepalives_interval to the values as mentioned above but still fails with same error.
My environment
Source: VM OS: RHEL8 PostgreSQL version : PostgreSQL v15.8 community edition SSL enabled
Target: VM OS: RHEL8 PostgreSQL version: PostgreSQL v15.8 community edition SSL enabled.
Please let me know any fix or modifications to be made for fixing this issue.
Hello,
we are also facing this issue when we reach the step replying ongoing changes from the source to the database.
I was wondering if this might be related to the pipeline mode introduced by the great changes made by @arajkumar , when pcopydb established TLS connection to the target database.
As others already reported, changing keep alive GUC didn't fix the issue , and we were not able to identify the root cause of the connection being dropped. This seems to happen regardless the Postgres server version.
Did anyone find a work-around to this or a possible fix by any chance?
@alessio-form3 Is it possible to share the logs from the target instance during the problematic scenario?
@alessio-form3 Is it possible to share the logs from the target instance during the problematic scenario?
I am not able to copy/paste the logs directly @arajkumar unfortuantely.
If you mean the Postgres server logs, they were reporting "connection reset by peer" entries. On the pgcopydb side, errors were exactly like the ones posted in comments above.
I was wondering if the issue might be due to the libpq library itself? Is the version pin-pointed?
Apart from the error itself, how would you consider adding the ability to choose, via configuration, whether to use pipeline mode or not?
Thank you.
Hello.
The same issue occurred in my environment. pgcopydb is 0.17, PostgreSQL is 16.4. And as expected, when not in pipeline mode, it does not occur even if it connect with SSL.
Since it is an SSL error, I checked libpq's SSL transmission. One thing I found out was that libpq does not retry the SSL_write() function when sending with SSL.
According to the openssl documentation SSL_write, if the return value of SSL_get_error() is SSL_ERROR_WANT_READ or SSL_ERROR_WANT_WRITE, it must be called again with the same data.
In libpq's message sending function pqPutMsgEnd(PGconn *conn), if not all data has been sent and in non-blocking mode, it just returns, but in the libpq's exported API (e.g. PQsendQueryGuts() called by PQsendQueryParams()), pqPutMsgEnd() is called multiple times, so I think the sent data changes.
So in the above situation, it needs to be retried with the same data, but it seems that the error occurs because the send data has changed.
As a test, I tried to retry if pqsecure_write() returned 0 in pqSendSome(), and it ran in pipeline mode without errors. pqSendSome() is a function which called from pqPutMsgEnd(PGconn *conn) and pqsecure_write() is called from this. In pqsecure_write() SSL_write() is performed.
I have attached a patch for PostgreSQL 16.4.
Hi @htgndbat95 ; thanks a lot for your contribution and patch here, that's a lot of good work! Unfortunately we won't be able to push that patch to Postgres from this repository. Please consider sending an email to pgsql-bugs or using the bugs report form on the Postgres website to move forward with your patch for libpq here.
Thank you for checking. I have submitted a bug report to PostgreSQL about this.
libpq was fixed in the following commit: 137935b