pgcopydb_v0.15: follow process with --resume doesn't terminate after command "pgcopydb stream sentinel set endpos"
Hi @dimitri
First of all, I would like to thank you for your help and patience.
You are right about https://github.com/dimitri/pgcopydb/issues/659. I do not need an additional session to create a snapshot. I looked at the postgres logs on pg11 (source). As I can see, the snapshot is created on pg11 and later deleted (not needed anymore). My mistake. I apologize for that.
One last question about the follow with --resume option.
Test scenarios for --resume:
$ pgcopydb version
08:25:54.529 3101459 INFO Running pgcopydb version 0.15.13.g2b4c98d from "/usr/pgsql-15/bin/pgcopydb"
pgcopydb version 0.15.13.g2b4c98d
compiled with PostgreSQL 15.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.4.1 20230605 (Red Hat 11.4.1-2), 64-bit
compatible with Postgres 11, 12, 13, 14, 15, and 16
- Started pgcopydb follow
$ pgcopydb clone \
--follow \
--plugin test_decoding \
--dir "$PGCOPYDB_DIR" \
--fail-fast \
--skip-extensions \
--skip-collations \
--split-tables-larger-than 2GB \
--slot-name pgcopydb_pgbench_test_decoding \
--notice \
2>&1 | tee $PGCOPYDB_LOG_DIR/pgcopydb_clone_follow-split-tables-larger-than_start.log
CDC works.
Pressed C-c.
- Try with resume:
$ pgcopydb clone \
--follow \
--plugin test_decoding \
--dir "$PGCOPYDB_DIR" \
--fail-fast \
--skip-extensions \
--skip-collations \
--split-tables-larger-than 2GB \
--slot-name pgcopydb_pgbench_test_decoding \
--resume \
--not-consistent \
--notice \
2>&1 | tee $PGCOPYDB_LOG_DIR/pgcopydb_clone_follow-split-tables-larger-than_start_resume_second_try.log
CDC works.
- Stop replication (Set end position):
$ pgcopydb stream sentinel set endpos --current --dir "$PGCOPYDB_DIR"
08:12:52.880 3099514 INFO Running pgcopydb version 0.15.13.g2b4c98d from "/usr/pgsql-15/bin/pgcopydb"
08:12:52.881 3099514 INFO Using work dir "/transfer/pg11/pgbench"
08:12:52.913 3099514 INFO Fetched endpos 1D/FF1E1718 from source database
08:12:52.913 3099514 INFO pgcopydb sentinel endpos has been set to 1D/FF1E1718
$ pgcopydb stream sentinel get --dir "$PGCOPYDB_DIR"
08:13:21.220 3099732 INFO Running pgcopydb version 0.15.13.g2b4c98d from "/usr/pgsql-15/bin/pgcopydb"
08:13:21.220 3099732 INFO Using work dir "/transfer/pg11/pgbench"
startpos 1D/FF1A62B0
endpos 1D/FF1E1718
apply enabled
write_lsn 1D/FF1A62B0
flush_lsn 1D/FF1A62B0
replay_lsn 1D/FF1A62B0
BUT: the follow process with --resume from point 2. does not terminate!
Output from follow process:
...
2024-01-31 08:15:08.896 3099277 INFO follow.c:216 Current sentinel replay_lsn is 1D/FF1A62B0, endpos is 1D/FF1E1718
2024-01-31 08:15:08.896 3099277 INFO follow.c:359 Restarting logical decoding follower in replay mode
2024-01-31 08:15:08.897 3100710 NOTICE follow.c:894 Starting the prefetch sub-process
2024-01-31 08:15:08.898 3100711 NOTICE follow.c:894 Starting the transform sub-process
2024-01-31 08:15:08.898 3100712 NOTICE follow.c:894 Starting the catchup sub-process
2024-01-31 08:15:08.899 3100710 INFO ld_stream.c:2013 Resuming streaming from latest file "/transfer/pg11/pgbench/cdc/000000030000001D000000FF.json"
2024-01-31 08:15:08.899 3100710 INFO ld_stream.c:621 Streaming is setup to end at LSN 1D/FF1E1718
2024-01-31 08:15:08.899 3100710 INFO ld_stream.c:652 Resuming streaming at LSN 1D/FF1A62B0 from JSON file "/transfer/pg11/pgbench/cdc/000000030000001D000000FF.json"
2024-01-31 08:15:08.900 3100710 NOTICE ld_stream.c:658 Resume replication from latest message: {"action":"K","lsn":"1D/FF1A62B0","timestamp":"2024-01-31 07:47:18.314058+0000"}
2024-01-31 08:15:08.912 3100711 NOTICE ld_transform.c:212 stream_transform_resume: startpos 1D/FF1A62B0 endpos 1D/FF1E1718 write_lsn 1D/FF1E6E08 flush_lsn 1D/FF1A62B0 replay_lsn 1D/FF1A62B0
2024-01-31 08:15:08.912 3100711 INFO ld_transform.c:234 Transform process is setup to end at LSN 1D/FF1E1718
2024-01-31 08:15:08.917 3100710 NOTICE ld_stream.c:455 Connecting to logical decoding replication stream
2024-01-31 08:15:08.947 3100710 INFO pgsql.c:4511 Reported write_lsn 1D/FF1A62B0, flush_lsn 1D/FF1A62B0, replay_lsn 1D/FF1A62B0
2024-01-31 08:15:09.012 3100711 NOTICE ld_transform.c:292 Transforming from 1D/FF1A62B0 in "/transfer/pg11/pgbench/cdc/000000030000001D000000FF.sql"
2024-01-31 08:15:09.012 3100711 NOTICE ld_transform.c:885 Transforming JSON file "/transfer/pg11/pgbench/cdc/000000030000001D000000FF.json" into SQL file "/transfer/pg11/pgbench/cdc/000000030000001D000000FF.sql"
2024-01-31 08:15:09.012 3100711 INFO ld_transform.c:1045 Transformed 15 JSON messages into SQL file "/transfer/pg11/pgbench/cdc/000000030000001D000000FF.sql"
2024-01-31 08:15:09.014 3100712 INFO ld_apply.c:1245 Setting up previous LSN from replication origin "pgcopydb" progress at 1D/FF137CF0, overriding previous value 1D/FF1A62B0
2024-01-31 08:15:09.014 3100712 INFO ld_apply.c:256 Replaying changes from LSN 1D/FF137CF0 up to endpos LSN 1D/FF1E1718
2024-01-31 08:15:10.424 3100710 INFO pgsql.c:4511 Reported write_lsn 1D/FF1E6EF0, flush_lsn 1D/FF1A62B0, replay_lsn 1D/FF1A62B0
2024-01-31 08:15:10.425 3100710 INFO ld_stream.c:517 Streamed up to write_lsn 1D/FF1E6EF0, flush_lsn 1D/FF1A62B0, stopping: endpos is 1D/FF1E1718
2024-01-31 08:15:10.425 3100710 INFO follow.c:692 Prefetch process has terminated
2024-01-31 08:15:10.425 3100711 NOTICE ld_transform.c:180 Transformed 0 messages and 1 transactions
2024-01-31 08:15:10.425 3100711 INFO follow.c:750 Transform process has terminated
2024-01-31 08:15:10.439 3100712 NOTICE ld_replay.c:129 Replay process stopping
2024-01-31 08:15:10.439 3100712 INFO follow.c:804 Apply process has terminated
2024-01-31 08:15:10.550 3099277 NOTICE follow.c:1056 Subprocess prefetch with pid 3100710 has exited successfully
2024-01-31 08:15:10.550 3099277 NOTICE follow.c:1056 Subprocess transform with pid 3100711 has exited successfully
2024-01-31 08:15:10.550 3099277 NOTICE follow.c:1056 Subprocess catchup with pid 3100712 has exited successfully
2024-01-31 08:15:10.700 3099277 INFO follow.c:645 Subprocesses for prefetch, transform, and catchup have now all exited
2024-01-31 08:15:10.700 3099277 INFO follow.c:216 Current sentinel replay_lsn is 1D/FF1A62B0, endpos is 1D/FF1E1718
2024-01-31 08:15:10.700 3099277 INFO follow.c:416 Catching-up from existing on-disk files
2024-01-31 08:15:10.701 3100732 NOTICE follow.c:894 Starting the catchup sub-process
2024-01-31 08:15:10.716 3100732 INFO ld_apply.c:1245 Setting up previous LSN from replication origin "pgcopydb" progress at 1D/FF137CF0, overriding previous value 1D/FF1A62B0
2024-01-31 08:15:10.716 3100732 INFO ld_apply.c:256 Catchup-up with changes from LSN 1D/FF137CF0 up to endpos LSN 1D/FF1E1718
2024-01-31 08:15:10.716 3100732 INFO ld_apply.c:457 Replaying changes from file "/transfer/pg11/pgbench/cdc/000000030000001D000000FF.sql"
2024-01-31 08:15:10.720 3100732 INFO ld_apply.c:139 Apply reached 1D/FF1A62B0 in "/transfer/pg11/pgbench/cdc/000000030000001D000000FF.sql"
2024-01-31 08:15:10.720 3100732 INFO ld_apply.c:157 Reached end of file "/transfer/pg11/pgbench/cdc/000000030000001D000000FF.sql" at 1D/FF1A62B0.
2024-01-31 08:15:10.720 3100732 INFO follow.c:823 Apply process has terminated
2024-01-31 08:15:10.851 3099277 NOTICE follow.c:1056 Subprocess catchup with pid 3100732 has exited successfully
2024-01-31 08:15:11.001 3099277 INFO follow.c:216 Current sentinel replay_lsn is 1D/FF1A62B0, endpos is 1D/FF1E1718
2024-01-31 08:15:11.001 3099277 INFO follow.c:359 Restarting logical decoding follower in catchup mode
2024-01-31 08:15:11.003 3100735 NOTICE follow.c:894 Starting the prefetch sub-process
2024-01-31 08:15:11.003 3100736 NOTICE follow.c:894 Starting the transform sub-process
2024-01-31 08:15:11.004 3100737 NOTICE follow.c:894 Starting the catchup sub-process
2024-01-31 08:15:11.004 3100735 INFO ld_stream.c:2013 Resuming streaming from latest file "/transfer/pg11/pgbench/cdc/000000030000001D000000FF.json"
2024-01-31 08:15:11.005 3100735 INFO ld_stream.c:621 Streaming is setup to end at LSN 1D/FF1E1718
2024-01-31 08:15:11.005 3100735 INFO ld_stream.c:652 Resuming streaming at LSN 1D/FF1A62B0 from JSON file "/transfer/pg11/pgbench/cdc/000000030000001D000000FF.json"
2024-01-31 08:15:11.005 3100735 NOTICE ld_stream.c:658 Resume replication from latest message: {"action":"K","lsn":"1D/FF1A62B0","timestamp":"2024-01-31 07:47:18.314058+0000"}
2024-01-31 08:15:11.022 3100735 NOTICE ld_stream.c:455 Connecting to logical decoding replication stream
2024-01-31 08:15:11.040 3100735 INFO pgsql.c:4511 Reported write_lsn 1D/FF1A62B0, flush_lsn 1D/FF1A62B0, replay_lsn 1D/FF1A62B0
2024-01-31 08:15:11.120 3100737 INFO ld_apply.c:1245 Setting up previous LSN from replication origin "pgcopydb" progress at 1D/FF137CF0, overriding previous value 1D/FF1A62B0
2024-01-31 08:15:11.120 3100737 INFO ld_apply.c:256 Catchup-up with changes from LSN 1D/FF137CF0 up to endpos LSN 1D/FF1E1718
2024-01-31 08:15:11.120 3100737 INFO ld_apply.c:457 Replaying changes from file "/transfer/pg11/pgbench/cdc/000000030000001D000000FF.sql"
2024-01-31 08:15:11.125 3100737 INFO ld_apply.c:139 Apply reached 1D/FF1A62B0 in "/transfer/pg11/pgbench/cdc/000000030000001D000000FF.sql"
2024-01-31 08:15:11.125 3100737 INFO ld_apply.c:157 Reached end of file "/transfer/pg11/pgbench/cdc/000000030000001D000000FF.sql" at 1D/FF1A62B0.
2024-01-31 08:15:11.125 3100737 INFO follow.c:823 Apply process has terminated
...
Pressed C-c in follow session.
2024-01-31 08:15:11.153 3099277 NOTICE follow.c:1056 Subprocess catchup with pid 3100737 has exited successfully
Is this the correct behavior or do I need to complete the migration in a different way? Please advise. When I use only pgcopydb --follow without the --resume option, the follow process ends when the command "pgcopydb stream sentinel set endpos" is executed.
Hi branecc, I actually had the same issue, I reported it here: https://github.com/dimitri/pgcopydb/issues/643 I've had no comments on it, so I'm hoping if someone else reported it, the we might be able to get a fix 👍 The behaviour is the same as what happened in my scenario, and the output is the same.