pgcopydb icon indicating copy to clipboard operation
pgcopydb copied to clipboard

pgcopydb v0.14: ERRORS with pgcopydb clone --follow --plugin test_decoding

Open branecc opened this issue 2 years ago • 3 comments

Hi, great job!

But... I get ERRORS when testing pgcopydb clone --follow --plugin test_decoding By the way: pgcopydb clone --follow --plugin wal2json WORKS

Environment:

Source pg11 (database pgbench, used pgbench tool) Target pg15

$ pgcopydb --version
10:53:56 520220 INFO   Running pgcopydb version 0.14 from "/usr/pgsql-15/bin/pgcopydb"
pgcopydb version 0.14
compiled with PostgreSQL 15.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.4.1 20230605 (Red Hat 11.4.1-2), 64-bit
compatible with Postgres 11, 12, 13, 14, 15, and 16

pgcopydb compiled from source https://github.com/dimitri/pgcopydb/archive/refs/tags/v0.14.tar.gz

Executed commands:

$ export PGCOPYDB_SOURCE_PGURI="postgres://repuser@drtest-pg1/pgbench"
$ export PGCOPYDB_TARGET_PGURI="postgres://repuser@pg4-test/pgbench"
$ export PGCOPYDB_DIR=/transfer/pg11/pgbench
$ export PGCOPYDB_LOG_DIR=/transfer/pg11/logs

$ pgcopydb clone \
   --follow \
   --dir "$PGCOPYDB_DIR" \
   --fail-fast \
   --slot-name pgcopydb_pgbench_test_encoding \
   --plugin test_decoding \
   2>&1 | tee $PGCOPYDB_LOG_DIR/pgcopydb_clone_follow_test_decoding_pgbench_pks_and_replica.log

Log file:

2024-01-08 09:57:24 516359 INFO   main.c:138                Running pgcopydb version 0.14 from "/usr/pgsql-15/bin/pgcopydb"
2024-01-08 09:57:24 516359 INFO   cli_common.c:1239         [SOURCE] Copying database from "postgres://repuser@drtest-pg1/pgbench?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60"
2024-01-08 09:57:24 516359 INFO   cli_common.c:1240         [TARGET] Copying database into "postgres://repuser@pg4-test/pgbench?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60"
2024-01-08 09:57:24 516359 INFO   pgsql.c:3683              Created logical replication slot "pgcopydb_pgbench_test_encoding" with plugin "test_decoding" at D/A09F2748 and exported snapshot 00000008-00008862-1
2024-01-08 09:57:24 516359 INFO   ld_stream.c:2544          create schema if not exists pgcopydb
2024-01-08 09:57:24 516359 INFO   ld_stream.c:2544          drop table if exists pgcopydb.sentinel
2024-01-08 09:57:24 516359 WARN   pgsql.c:826               NOTICE:  table "sentinel" does not exist, skipping
2024-01-08 09:57:24 516359 INFO   ld_stream.c:2544          create table pgcopydb.sentinel(startpos pg_lsn, endpos pg_lsn, apply bool,  write_lsn pg_lsn, flush_lsn pg_lsn, replay_lsn pg_lsn)
2024-01-08 09:57:24 516359 INFO   ld_stream.c:2453          Created logical replication origin "pgcopydb" at LSN D/A09F2748
2024-01-08 09:57:24 516363 INFO   cli_clone_follow.c:530    STEP 1: fetch source database tables, indexes, and sequences
2024-01-08 09:57:24 516367 INFO   ld_apply.c:355            Waiting until the pgcopydb sentinel apply is enabled
2024-01-08 09:57:24 516363 INFO   copydb_schema.c:143       Fetched information for 1 extensions
2024-01-08 09:57:24 516365 INFO   ld_stream.c:649           Resuming streaming at LSN D/A09F2748 from replication slot "pgcopydb_pgbench_test_encoding"
2024-01-08 09:57:24 516363 INFO   copydb_schema.c:440       Fetched information for 4 tables, with an estimated total of 1020  tuples and 130 MB
2024-01-08 09:57:24 516363 INFO   copydb_schema.c:466       Fetched information for 3 indexes
2024-01-08 09:57:24 516363 INFO   sequences.c:41            Fetching information for 0 sequences
2024-01-08 09:57:24 516363 INFO   cli_clone_follow.c:547    STEP 2: dump the source database schema (pre/post data)
2024-01-08 09:57:24 516363 INFO   pgcmd.c:473                /usr/pgsql-15/bin/pg_dump -Fc --snapshot 00000008-00008862-1 --section pre-data --file /transfer/pg11/pgbench/schema/pre.dump 'postgres://repuser@drtest-pg1/pgbench?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60'
2024-01-08 09:57:24 516365 INFO   pgsql.c:4400              Reported write_lsn D/A09F2748, flush_lsn D/A09F2748, replay_lsn 0/0
2024-01-08 09:57:25 516363 INFO   pgcmd.c:473                /usr/pgsql-15/bin/pg_dump -Fc --snapshot 00000008-00008862-1 --section post-data --file /transfer/pg11/pgbench/schema/post.dump 'postgres://repuser@drtest-pg1/pgbench?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60'
2024-01-08 09:57:25 516363 INFO   cli_clone_follow.c:559    STEP 3: restore the pre-data section to the target database
2024-01-08 09:57:25 516363 INFO   pgcmd.c:901                /usr/pgsql-15/bin/pg_restore --dbname 'postgres://repuser@pg4-test/pgbench?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60' --single-transaction --use-list /transfer/pg11/pgbench/schema/pre-filtered.list /transfer/pg11/pgbench/schema/pre.dump
2024-01-08 09:57:25 516363 INFO   indexes.c:30              STEP 6: starting 4 CREATE INDEX processes
2024-01-08 09:57:25 516363 INFO   indexes.c:31              STEP 7: constraints are built by the CREATE INDEX processes
2024-01-08 09:57:25 516376 INFO   table-data.c:420          STEP 4: starting 4 table data COPY processes
2024-01-08 09:57:25 516363 INFO   vacuum.c:32               STEP 8: starting 4 VACUUM processes
2024-01-08 09:57:25 516363 INFO   sequences.c:104           STEP 9: reset sequences values
2024-01-08 09:57:25 516389 INFO   blobs.c:189               STEP 5: starting 4 Large Objects workers
2024-01-08 09:57:25 516390 INFO   sequences.c:172           Reset sequences values on the target database
2024-01-08 09:57:25 516389 INFO   blobs.c:501               Added 0 large objects to the queue
2024-01-08 09:57:27 516363 INFO   cli_clone_follow.c:581    STEP 10: restore the post-data section to the target database
2024-01-08 09:57:27 516363 INFO   pgcmd.c:901                /usr/pgsql-15/bin/pg_restore --dbname 'postgres://repuser@pg4-test/pgbench?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60' --single-transaction --use-list /transfer/pg11/pgbench/schema/post-filtered.list /transfer/pg11/pgbench/schema/post.dump
2024-01-08 09:57:27 516363 INFO   cli_clone_follow.c:599    Updating the pgcopydb.sentinel to enable applying changes

  OID | Schema |             Name | copy duration | indexes | create index duration
------+--------+------------------+---------------+---------+----------------------
39023 | public | pgbench_accounts |         1s493 |       1 |                 269ms
39017 | public |  pgbench_history |         173ms |       0 |                   0ms
39026 | public | pgbench_branches |          16ms |       1 |                  24ms
39020 | public |  pgbench_tellers |          37ms |       1 |                  16ms


                                               Step   Connection    Duration   Concurrency
 --------------------------------------------------   ----------  ----------  ------------
                                        Dump Schema       source        39ms             1
   Catalog Queries (table ordering, filtering, etc)       source       314ms             1
                                     Prepare Schema       target        43ms             1
      COPY, INDEX, CONSTRAINTS, VACUUM (wall clock)         both       1s907         4 + 8
                                  COPY (cumulative)         both       1s719             4
                         Large Objects (cumulative)         both       156ms             4
             CREATE INDEX, CONSTRAINTS (cumulative)       target       309ms             4
                                    Finalize Schema       target       450ms             1
 --------------------------------------------------   ----------  ----------  ------------
                          Total Wall Clock Duration         both       2s783         4 + 8
 --------------------------------------------------   ----------  ----------  ------------

2024-01-08 09:57:27 516367 INFO   ld_apply.c:365            The pgcopydb sentinel has enabled applying changes
2024-01-08 09:57:27 516367 INFO   ld_apply.c:1277           Setting up previous LSN from replication origin "pgcopydb" progress at D/A09F2748
2024-01-08 09:57:27 516367 INFO   ld_apply.c:254            Catchup-up with changes from LSN D/A09F2748
2024-01-08 09:57:27 516367 INFO   ld_apply.c:93             File "/transfer/pg11/pgbench/cdc/000000030000000D000000A0.sql" does not exists yet, exit
2024-01-08 09:57:28 516365 ERROR  pgsql.c:1511              Postgres query was interrupted: update pgcopydb.sentinel set startpos = $2, write_lsn = $1, flush_lsn = $2 returning startpos, endpos, apply, write_lsn, flush_lsn, replay_lsn
2024-01-08 09:57:28 516365 ERROR  pgsql.c:5336              Failed to fetch pgcopydb.sentinel current values
2024-01-08 09:57:28 516365 INFO   pgsql.c:4400              Reported write_lsn D/A65E0FC8, flush_lsn D/A65E0FC8, replay_lsn 0/0
2024-01-08 09:57:28 516365 INFO   ld_stream.c:522           Streamed up to write_lsn D/A65E0FC8, flush_lsn D/A65E0FC8, stopping: endpos is 0/0
2024-01-08 09:57:28 516365 INFO   follow.c:674              Prefetch process has terminated
2024-01-08 09:57:28 516364 INFO   follow.c:616              Subprocesses for prefetch, transform, and catchup have now all exited
2024-01-08 09:57:28 516364 INFO   follow.c:426              Catching-up from existing on-disk files
2024-01-08 09:57:28 516428 INFO   ld_transform.c:983        Transformed 2 JSON messages into SQL file "/transfer/pg11/pgbench/cdc/000000030000000D000000A0.sql"
2024-01-08 09:57:28 516428 INFO   ld_transform.c:983        Transformed 1 JSON messages into SQL file "/transfer/pg11/pgbench/cdc/000000030000000D000000A6.sql"
2024-01-08 09:57:28 516430 INFO   ld_apply.c:1277           Setting up previous LSN from replication origin "pgcopydb" progress at D/A09F2748
2024-01-08 09:57:28 516430 INFO   ld_apply.c:254            Catchup-up with changes from LSN D/A09F2748
2024-01-08 09:57:28 516430 INFO   ld_apply.c:540            Replaying changes from file "/transfer/pg11/pgbench/cdc/000000030000000D000000A0.sql"
2024-01-08 09:57:28 516430 INFO   ld_apply.c:131            Apply reached D/A0A192A8 in "/transfer/pg11/pgbench/cdc/000000030000000D000000A0.sql"
2024-01-08 09:57:28 516430 INFO   ld_apply.c:540            Replaying changes from file "/transfer/pg11/pgbench/cdc/000000030000000D000000A6.sql"
2024-01-08 09:57:28 516430 INFO   ld_apply.c:131            Apply reached D/A65E0FC8 in "/transfer/pg11/pgbench/cdc/000000030000000D000000A6.sql"
2024-01-08 09:57:28 516430 INFO   ld_apply.c:149            Reached end of file "/transfer/pg11/pgbench/cdc/000000030000000D000000A6.sql" at D/A65E0FC8.
2024-01-08 09:57:29 516364 INFO   follow.c:230              Current sentinel replay_lsn is D/A65E0FC8
2024-01-08 09:57:29 516364 INFO   follow.c:369              Restarting logical decoding follower in replay mode
2024-01-08 09:57:29 516432 INFO   ld_stream.c:2014          Resuming streaming from latest file "/transfer/pg11/pgbench/cdc/000000030000000D000000A6.json"
2024-01-08 09:57:29 516433 INFO   ld_transform.c:983        Transformed 1 JSON messages into SQL file "/transfer/pg11/pgbench/cdc/000000030000000D000000A6.sql"
2024-01-08 09:57:29 516432 INFO   ld_stream.c:694           Resuming streaming at LSN D/A65E0FC8 from first message with that LSN read in JSON file "/transfer/pg11/pgbench/cdc/000000030000000D000000A6.json", line 0
2024-01-08 09:57:29 516434 INFO   ld_apply.c:1286           Setting up previous LSN from replication origin "pgcopydb" progress at D/A09F2748, overriding previous value D/A65E0FC8
2024-01-08 09:57:29 516434 INFO   ld_apply.c:254            Replaying changes from LSN D/A09F2748
2024-01-08 09:57:29 516432 INFO   pgsql.c:4400              Reported write_lsn D/A65E0FC8, flush_lsn D/A65E0FC8, replay_lsn D/A65E0FC8
2024-01-08 09:57:40 516432 INFO   pgsql.c:4400              Reported write_lsn D/A65E17D0, flush_lsn D/A65E0FC8, replay_lsn D/A65E0FC8
2024-01-08 09:57:50 516432 INFO   pgsql.c:4400              Reported write_lsn D/A65E1DA8, flush_lsn D/A65E13F8, replay_lsn D/A65E13F8

Compare data:

$ pgcopydb compare data --notice --dir "$PGCOPYDB_DIR"
10:01:30 516728 INFO   Running pgcopydb version 0.14 from "/usr/pgsql-15/bin/pgcopydb"
10:01:30 516728 NOTICE Using work dir "/transfer/pg11/pgbench"
10:01:30 516728 NOTICE Work directory "/transfer/pg11/pgbench" already exists
10:01:30 516728 INFO   A previous run has run through completion
10:01:30 516728 INFO   SOURCE: Connecting to "postgres://repuser@drtest-pg1/pgbench?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60"
10:01:30 516728 INFO   Fetched information for 4 tables, with an estimated total of 1020  tuples and 130 MB
10:01:30 516728 NOTICE Skipping target catalog preparation
10:01:30 516728 INFO   starting 4 table compare processes
10:01:30 516730 NOTICE Started table worker 516730 [516728]
10:01:30 516731 NOTICE Started table worker 516731 [516728]
10:01:30 516732 NOTICE Started table worker 516732 [516728]
10:01:30 516733 NOTICE Started table worker 516733 [516728]
10:01:30 516730 NOTICE public.pgbench_branches: 10 rows, checksum 7c520291-568e-c529-5e92-6aa7ff0de93
10:01:30 516733 NOTICE public.pgbench_history: 20000 rows, checksum 5229ca74-fa99-7982-0c41-45d50865d37
10:01:30 516731 NOTICE public.pgbench_tellers: 100 rows, checksum 05ce2173-65e2-6dd6-d7cc-c81d6e10bfb
10:01:30 516732 NOTICE public.pgbench_accounts: 1000000 rows, checksum ebf5d608-41d7-0821-f68e-85c41f201e8
10:01:30 516728 NOTICE Storing migration schema in JSON file "/transfer/pg11/pgbench/schema.json"
                    Table Name | ! |                      Source Checksum |                      Target Checksum
-------------------------------+---+--------------------------------------+-------------------------------------
       public.pgbench_accounts |   |  ebf5d608-41d7-0821-f68e-85c41f201e8 |  ebf5d608-41d7-0821-f68e-85c41f201e8
        public.pgbench_history |   |  5229ca74-fa99-7982-0c41-45d50865d37 |  5229ca74-fa99-7982-0c41-45d50865d37
       public.pgbench_branches |   |  7c520291-568e-c529-5e92-6aa7ff0de93 |  7c520291-568e-c529-5e92-6aa7ff0de93
        public.pgbench_tellers |   |  05ce2173-65e2-6dd6-d7cc-c81d6e10bfb |  05ce2173-65e2-6dd6-d7cc-c81d6e10bfb

Insert new data into pg11:

$ pgbench -c 1 -t 20000 --no-vacuum pgbench

Logs:

$ cat $PGCOPYDB_LOG_DIR/pgcopydb_clone_follow_test_decoding_pgbench_pks_and_replica.log
...
2024-01-08 10:07:52 516432 INFO   pgsql.c:4400              Reported write_lsn D/A662A7E0, flush_lsn D/A66291C8, replay_lsn D/A6628BB8
2024-01-08 10:08:02 516432 INFO   pgsql.c:4400              Reported write_lsn D/A662ADB8, flush_lsn D/A662A408, replay_lsn D/A662A408
2024-01-08 10:08:12 516433 ERROR  ld_test_decoding.c:979    Failed to parse decoding message for UPDATE on table public.pgbench_accounts which is not in our catalogs
2024-01-08 10:08:12 516433 ERROR  ld_test_decoding.c:501    Failed to parse UPDATE new-tuple columns for logical message table public.pgbench_accounts: UPDATE: aid[integer]:352867 bid[integer]:4 abalance[integer]:-3280 filler[character]:'                                                                                    '
2024-01-08 10:08:12 516433 ERROR  ld_test_decoding.c:272    Failed to parse test_decoding UPDATE message: table public.pgbench_accounts: UPDATE: aid[integer]:352867 bid[integer]:4 abalance[integer]:-3280 filler[character]:'                                                                                    '
2024-01-08 10:08:12 516433 ERROR  ld_transform.c:1236       Failed to parse test_decoding message, see above for details
2024-01-08 10:08:12 516433 ERROR  ld_transform.c:472        Failed to parse JSON message: {"action":"U","xid":"0","lsn":"D/A662B408","timestamp":"2024-01-08 10:08:12.672143+0000","message":"table public.pgbench_accounts: UPDATE: aid[integer]:352867 bid[integer]:4 abalance[integer]:-3280 filler[character]:'                                                                                    '"}
2024-01-08 10:08:12 516433 ERROR  ld_transform.c:100        Failed to transform JSON messages from input stream, see above for details
2024-01-08 10:08:12 516364 ERROR  follow.c:964              Subprocess transform with pid 516433 has exited with error code 12

Compare data:

$ pgcopydb compare data --notice --dir "$PGCOPYDB_DIR"
10:12:04 517455 INFO   Running pgcopydb version 0.14 from "/usr/pgsql-15/bin/pgcopydb"
10:12:04 517455 NOTICE Using work dir "/transfer/pg11/pgbench"
10:12:04 517455 NOTICE Work directory "/transfer/pg11/pgbench" already exists
10:12:04 517455 INFO   A previous run has run through completion
10:12:04 517455 INFO   SOURCE: Connecting to "postgres://repuser@drtest-pg1/pgbench?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60"
10:12:04 517455 INFO   Fetched information for 4 tables, with an estimated total of 1040  tuples and 132 MB
10:12:04 517455 NOTICE Skipping target catalog preparation
10:12:04 517455 INFO   starting 4 table compare processes
10:12:04 517457 NOTICE Started table worker 517457 [517455]
10:12:04 517458 NOTICE Started table worker 517458 [517455]
10:12:04 517459 NOTICE Started table worker 517459 [517455]
10:12:04 517460 NOTICE Started table worker 517460 [517455]
10:12:04 517459 ERROR  Table public.pgbench_tellers has checksum a07af71f-0ed1-fdee-17a6-fdfb1134ca7 on source, 05ce2173-65e2-6dd6-d7cc-c81d6e10bfb on target
10:12:04 517459 NOTICE public.pgbench_tellers: 100 rows, checksum a07af71f-0ed1-fdee-17a6-fdfb1134ca7
10:12:04 517457 ERROR  Table public.pgbench_history has 40000 rows on source, 20000 rows on target
10:12:04 517457 NOTICE public.pgbench_history: 40000 rows, checksum a7b60678-cd1b-1000-e4f3-4509feb7f6b
10:12:04 517458 ERROR  Table public.pgbench_branches has checksum 9ef9752b-5fef-195c-9546-9a18d8c3e72 on source, 7c520291-568e-c529-5e92-6aa7ff0de93 on target
10:12:04 517458 NOTICE public.pgbench_branches: 10 rows, checksum 9ef9752b-5fef-195c-9546-9a18d8c3e72
10:12:05 517460 ERROR  Table public.pgbench_accounts has checksum 19d9562c-545e-1a99-c7df-d718361eee8 on source, ebf5d608-41d7-0821-f68e-85c41f201e8 on target
10:12:05 517460 NOTICE public.pgbench_accounts: 1000000 rows, checksum 19d9562c-545e-1a99-c7df-d718361eee8
10:12:05 517455 NOTICE Storing migration schema in JSON file "/transfer/pg11/pgbench/schema.json"
                    Table Name | ! |                      Source Checksum |                      Target Checksum
-------------------------------+---+--------------------------------------+-------------------------------------
       public.pgbench_accounts | ! |  19d9562c-545e-1a99-c7df-d718361eee8 |  ebf5d608-41d7-0821-f68e-85c41f201e8
        public.pgbench_history | ! |  a7b60678-cd1b-1000-e4f3-4509feb7f6b |  5229ca74-fa99-7982-0c41-45d50865d37
       public.pgbench_branches | ! |  9ef9752b-5fef-195c-9546-9a18d8c3e72 |  7c520291-568e-c529-5e92-6aa7ff0de93
        public.pgbench_tellers | ! |  a07af71f-0ed1-fdee-17a6-fdfb1134ca7 |  05ce2173-65e2-6dd6-d7cc-c81d6e10bfb

branecc avatar Jan 08 '24 11:01 branecc

Hi @branecc ; what happens is that the wal2json output plugin includes enough information (pkey) with each message for the client (pgcopydb here) to re-build the SQL command whereas test_decoding does not, so the client program has to maintain that information itself.

Please try to build from current sources, where I believe thanks to maintaining a local cache of the source table catalogs on-disk (using SQLite), we should not see that problem anymore.

dimitri avatar Jan 10 '24 11:01 dimitri

Advice please!

Which output plugin is recommended: test_decoding or wal2json?

I am trying to migrate databases from pg11 to pg15. I tried native logical replication and had problems with errors like "ERROR: all replication slots are in use" HINT: Free one or increase max_replication_slots Increasing max_replication_slots requires a server restart. I can't afford that. Especially since you cannot know how many replication slots postgres will create. A new replication slot is created for each table. We have parallelism turned on. So we have one main and N temporary replication slots. It seems to me that postgres is not releasing replication slots correctly. As far as I could find on the internet, others had similar errors. The bug seems to be fixed with postgres version 14 (not tested), which is of no use to me.

I will try with pgcopydb. My database migration solution with native logical replication involves a lot of manual work. pgcopydb is much more sophisticated with good documentation.

So: test_decoding or wal2json?

branecc avatar Jan 10 '24 13:01 branecc

At the moment unfortunately neither test_decoding nor wal2json has the wanted feature where “it just works”. Consider building wal2json from sources to benefit from https://github.com/eulerto/wal2json/pull/255 and then use pgcopydb clone --follow --plugin wal2json --wal2json-numeric-as-string ... to avoid pitfalls with numbers representation in JSON. I believe that's the best way to do it at this time.

dimitri avatar Jan 10 '24 15:01 dimitri