pgcopydb icon indicating copy to clipboard operation
pgcopydb copied to clipboard

pgcopydb_v0.15: follow process with --resume doesn't terminate after command "pgcopydb stream sentinel set endpos"

Open branecc opened this issue 1 year ago • 1 comments

Hi @dimitri

First of all, I would like to thank you for your help and patience.

You are right about https://github.com/dimitri/pgcopydb/issues/659. I do not need an additional session to create a snapshot. I looked at the postgres logs on pg11 (source). As I can see, the snapshot is created on pg11 and later deleted (not needed anymore). My mistake. I apologize for that.

One last question about the follow with --resume option.

Test scenarios for --resume:

$ pgcopydb version
08:25:54.529 3101459 INFO   Running pgcopydb version 0.15.13.g2b4c98d from "/usr/pgsql-15/bin/pgcopydb"
pgcopydb version 0.15.13.g2b4c98d
compiled with PostgreSQL 15.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.4.1 20230605 (Red Hat 11.4.1-2), 64-bit
compatible with Postgres 11, 12, 13, 14, 15, and 16
  1. Started pgcopydb follow
$ pgcopydb clone \
   --follow \
   --plugin test_decoding \
   --dir "$PGCOPYDB_DIR" \
   --fail-fast \
   --skip-extensions \
   --skip-collations \
   --split-tables-larger-than 2GB \
   --slot-name pgcopydb_pgbench_test_decoding \
   --notice \
   2>&1 | tee $PGCOPYDB_LOG_DIR/pgcopydb_clone_follow-split-tables-larger-than_start.log

CDC works.

Pressed C-c.

  1. Try with resume:
$ pgcopydb clone \
   --follow \
   --plugin test_decoding \
   --dir "$PGCOPYDB_DIR" \
   --fail-fast \
   --skip-extensions \
   --skip-collations \
   --split-tables-larger-than 2GB \
   --slot-name pgcopydb_pgbench_test_decoding \
   --resume \
   --not-consistent \
   --notice \
   2>&1 | tee $PGCOPYDB_LOG_DIR/pgcopydb_clone_follow-split-tables-larger-than_start_resume_second_try.log

CDC works.

  1. Stop replication (Set end position):
$ pgcopydb stream sentinel set endpos --current --dir "$PGCOPYDB_DIR"
08:12:52.880 3099514 INFO   Running pgcopydb version 0.15.13.g2b4c98d from "/usr/pgsql-15/bin/pgcopydb"
08:12:52.881 3099514 INFO   Using work dir "/transfer/pg11/pgbench"
08:12:52.913 3099514 INFO   Fetched endpos 1D/FF1E1718 from source database
08:12:52.913 3099514 INFO   pgcopydb sentinel endpos has been set to 1D/FF1E1718

$ pgcopydb stream sentinel get --dir "$PGCOPYDB_DIR"
08:13:21.220 3099732 INFO   Running pgcopydb version 0.15.13.g2b4c98d from "/usr/pgsql-15/bin/pgcopydb"
08:13:21.220 3099732 INFO   Using work dir "/transfer/pg11/pgbench"
startpos   1D/FF1A62B0
endpos     1D/FF1E1718
apply      enabled
write_lsn  1D/FF1A62B0
flush_lsn  1D/FF1A62B0
replay_lsn 1D/FF1A62B0

BUT: the follow process with --resume from point 2. does not terminate!

Output from follow process:

...
2024-01-31 08:15:08.896 3099277 INFO   follow.c:216              Current sentinel replay_lsn is 1D/FF1A62B0, endpos is 1D/FF1E1718
2024-01-31 08:15:08.896 3099277 INFO   follow.c:359              Restarting logical decoding follower in replay mode
2024-01-31 08:15:08.897 3100710 NOTICE follow.c:894              Starting the prefetch sub-process
2024-01-31 08:15:08.898 3100711 NOTICE follow.c:894              Starting the transform sub-process
2024-01-31 08:15:08.898 3100712 NOTICE follow.c:894              Starting the catchup sub-process
2024-01-31 08:15:08.899 3100710 INFO   ld_stream.c:2013          Resuming streaming from latest file "/transfer/pg11/pgbench/cdc/000000030000001D000000FF.json"
2024-01-31 08:15:08.899 3100710 INFO   ld_stream.c:621           Streaming is setup to end at LSN 1D/FF1E1718
2024-01-31 08:15:08.899 3100710 INFO   ld_stream.c:652           Resuming streaming at LSN 1D/FF1A62B0 from JSON file "/transfer/pg11/pgbench/cdc/000000030000001D000000FF.json"
2024-01-31 08:15:08.900 3100710 NOTICE ld_stream.c:658           Resume replication from latest message: {"action":"K","lsn":"1D/FF1A62B0","timestamp":"2024-01-31 07:47:18.314058+0000"}
2024-01-31 08:15:08.912 3100711 NOTICE ld_transform.c:212        stream_transform_resume: startpos 1D/FF1A62B0 endpos 1D/FF1E1718 write_lsn 1D/FF1E6E08 flush_lsn 1D/FF1A62B0 replay_lsn 1D/FF1A62B0
2024-01-31 08:15:08.912 3100711 INFO   ld_transform.c:234        Transform process is setup to end at LSN 1D/FF1E1718
2024-01-31 08:15:08.917 3100710 NOTICE ld_stream.c:455           Connecting to logical decoding replication stream
2024-01-31 08:15:08.947 3100710 INFO   pgsql.c:4511              Reported write_lsn 1D/FF1A62B0, flush_lsn 1D/FF1A62B0, replay_lsn 1D/FF1A62B0
2024-01-31 08:15:09.012 3100711 NOTICE ld_transform.c:292        Transforming from 1D/FF1A62B0 in "/transfer/pg11/pgbench/cdc/000000030000001D000000FF.sql"
2024-01-31 08:15:09.012 3100711 NOTICE ld_transform.c:885        Transforming JSON file "/transfer/pg11/pgbench/cdc/000000030000001D000000FF.json" into SQL file "/transfer/pg11/pgbench/cdc/000000030000001D000000FF.sql"
2024-01-31 08:15:09.012 3100711 INFO   ld_transform.c:1045       Transformed 15 JSON messages into SQL file "/transfer/pg11/pgbench/cdc/000000030000001D000000FF.sql"
2024-01-31 08:15:09.014 3100712 INFO   ld_apply.c:1245           Setting up previous LSN from replication origin "pgcopydb" progress at 1D/FF137CF0, overriding previous value 1D/FF1A62B0
2024-01-31 08:15:09.014 3100712 INFO   ld_apply.c:256            Replaying changes from LSN 1D/FF137CF0 up to endpos LSN 1D/FF1E1718
2024-01-31 08:15:10.424 3100710 INFO   pgsql.c:4511              Reported write_lsn 1D/FF1E6EF0, flush_lsn 1D/FF1A62B0, replay_lsn 1D/FF1A62B0
2024-01-31 08:15:10.425 3100710 INFO   ld_stream.c:517           Streamed up to write_lsn 1D/FF1E6EF0, flush_lsn 1D/FF1A62B0, stopping: endpos is 1D/FF1E1718
2024-01-31 08:15:10.425 3100710 INFO   follow.c:692              Prefetch process has terminated
2024-01-31 08:15:10.425 3100711 NOTICE ld_transform.c:180        Transformed 0 messages and 1 transactions
2024-01-31 08:15:10.425 3100711 INFO   follow.c:750              Transform process has terminated
2024-01-31 08:15:10.439 3100712 NOTICE ld_replay.c:129           Replay process stopping
2024-01-31 08:15:10.439 3100712 INFO   follow.c:804              Apply process has terminated
2024-01-31 08:15:10.550 3099277 NOTICE follow.c:1056             Subprocess prefetch with pid 3100710 has exited successfully
2024-01-31 08:15:10.550 3099277 NOTICE follow.c:1056             Subprocess transform with pid 3100711 has exited successfully
2024-01-31 08:15:10.550 3099277 NOTICE follow.c:1056             Subprocess catchup with pid 3100712 has exited successfully
2024-01-31 08:15:10.700 3099277 INFO   follow.c:645              Subprocesses for prefetch, transform, and catchup have now all exited
2024-01-31 08:15:10.700 3099277 INFO   follow.c:216              Current sentinel replay_lsn is 1D/FF1A62B0, endpos is 1D/FF1E1718
2024-01-31 08:15:10.700 3099277 INFO   follow.c:416              Catching-up from existing on-disk files
2024-01-31 08:15:10.701 3100732 NOTICE follow.c:894              Starting the catchup sub-process
2024-01-31 08:15:10.716 3100732 INFO   ld_apply.c:1245           Setting up previous LSN from replication origin "pgcopydb" progress at 1D/FF137CF0, overriding previous value 1D/FF1A62B0
2024-01-31 08:15:10.716 3100732 INFO   ld_apply.c:256            Catchup-up with changes from LSN 1D/FF137CF0 up to endpos LSN 1D/FF1E1718
2024-01-31 08:15:10.716 3100732 INFO   ld_apply.c:457            Replaying changes from file "/transfer/pg11/pgbench/cdc/000000030000001D000000FF.sql"
2024-01-31 08:15:10.720 3100732 INFO   ld_apply.c:139            Apply reached 1D/FF1A62B0 in "/transfer/pg11/pgbench/cdc/000000030000001D000000FF.sql"
2024-01-31 08:15:10.720 3100732 INFO   ld_apply.c:157            Reached end of file "/transfer/pg11/pgbench/cdc/000000030000001D000000FF.sql" at 1D/FF1A62B0.
2024-01-31 08:15:10.720 3100732 INFO   follow.c:823              Apply process has terminated
2024-01-31 08:15:10.851 3099277 NOTICE follow.c:1056             Subprocess catchup with pid 3100732 has exited successfully
2024-01-31 08:15:11.001 3099277 INFO   follow.c:216              Current sentinel replay_lsn is 1D/FF1A62B0, endpos is 1D/FF1E1718
2024-01-31 08:15:11.001 3099277 INFO   follow.c:359              Restarting logical decoding follower in catchup mode
2024-01-31 08:15:11.003 3100735 NOTICE follow.c:894              Starting the prefetch sub-process
2024-01-31 08:15:11.003 3100736 NOTICE follow.c:894              Starting the transform sub-process
2024-01-31 08:15:11.004 3100737 NOTICE follow.c:894              Starting the catchup sub-process
2024-01-31 08:15:11.004 3100735 INFO   ld_stream.c:2013          Resuming streaming from latest file "/transfer/pg11/pgbench/cdc/000000030000001D000000FF.json"
2024-01-31 08:15:11.005 3100735 INFO   ld_stream.c:621           Streaming is setup to end at LSN 1D/FF1E1718
2024-01-31 08:15:11.005 3100735 INFO   ld_stream.c:652           Resuming streaming at LSN 1D/FF1A62B0 from JSON file "/transfer/pg11/pgbench/cdc/000000030000001D000000FF.json"
2024-01-31 08:15:11.005 3100735 NOTICE ld_stream.c:658           Resume replication from latest message: {"action":"K","lsn":"1D/FF1A62B0","timestamp":"2024-01-31 07:47:18.314058+0000"}
2024-01-31 08:15:11.022 3100735 NOTICE ld_stream.c:455           Connecting to logical decoding replication stream
2024-01-31 08:15:11.040 3100735 INFO   pgsql.c:4511              Reported write_lsn 1D/FF1A62B0, flush_lsn 1D/FF1A62B0, replay_lsn 1D/FF1A62B0
2024-01-31 08:15:11.120 3100737 INFO   ld_apply.c:1245           Setting up previous LSN from replication origin "pgcopydb" progress at 1D/FF137CF0, overriding previous value 1D/FF1A62B0
2024-01-31 08:15:11.120 3100737 INFO   ld_apply.c:256            Catchup-up with changes from LSN 1D/FF137CF0 up to endpos LSN 1D/FF1E1718
2024-01-31 08:15:11.120 3100737 INFO   ld_apply.c:457            Replaying changes from file "/transfer/pg11/pgbench/cdc/000000030000001D000000FF.sql"
2024-01-31 08:15:11.125 3100737 INFO   ld_apply.c:139            Apply reached 1D/FF1A62B0 in "/transfer/pg11/pgbench/cdc/000000030000001D000000FF.sql"
2024-01-31 08:15:11.125 3100737 INFO   ld_apply.c:157            Reached end of file "/transfer/pg11/pgbench/cdc/000000030000001D000000FF.sql" at 1D/FF1A62B0.
2024-01-31 08:15:11.125 3100737 INFO   follow.c:823              Apply process has terminated
...

Pressed C-c in follow session.

2024-01-31 08:15:11.153 3099277 NOTICE follow.c:1056 Subprocess catchup with pid 3100737 has exited successfully

Is this the correct behavior or do I need to complete the migration in a different way? Please advise. When I use only pgcopydb --follow without the --resume option, the follow process ends when the command "pgcopydb stream sentinel set endpos" is executed.

branecc avatar Feb 01 '24 09:02 branecc

Hi branecc, I actually had the same issue, I reported it here: https://github.com/dimitri/pgcopydb/issues/643 I've had no comments on it, so I'm hoping if someone else reported it, the we might be able to get a fix 👍 The behaviour is the same as what happened in my scenario, and the output is the same.

andres-chavez-bi avatar Feb 01 '24 10:02 andres-chavez-bi