End report after a db clone shows higher numbers than expected
So while transferring a database on the main branch, I noticed the report is showing elevated numbers. Here is the estimation at the beginning, which matches our estimate of our ~150 GB database:
2024-03-05 13:28:35.885 75 INFO copydb_schema.c:773 Table public.pgbench_accounts is 125 GB large, 2687 COPY processes will be used, partitioning on aid.
2024-03-05 13:28:35.886 75 INFO copydb_schema.c:665 Fetched information for 4 tables (including 1 tables split in 2687 partitions total), with an estimated total of 1000 million tuples and 125 GB on-disk
But as you can see below, the report shows highly elevated numbers in the TB.
For example, the transmitted bytes in the first table shows 248 TB, and the column for transfer shows 328 TB.
Also, the copy duration for pgbench_accounts is showing 2939 days, which might contribute to the math issue here.
This didn't show these numbers until 0.15 and later, this test was done in this recent main: Running pgcopydb version 0.15.32.g67df98b from "/usr/local/bin/pgcopydb".
2024-03-05 17:32:00.891 75 INFO pgcmd.c:998 /usr/bin/pg_restore --dbname 'postgres://target@target_host:5432/d8v0iusebghvok?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60' --single-transaction --no-owner --no-acl --exclude-schema _heroku --use-list /tmp/pgcopydb/schema/post-filtered.list /tmp/pgcopydb/schema/post.dump
2024-03-05 17:32:00.968 75 INFO cli_clone_follow.c:617 All step are now done, 4h03m elapsed
2024-03-05 17:32:00.969 75 INFO summary.c:3152 Printing summary for 4 tables and 3 indexes
2024-03-05 17:32:02.017 75 NOTICE summary.c:2835 Storing migration summary in JSON file "/tmp/pgcopydb/summary.json"
OID | Schema | Name | Parts | copy duration | transmitted bytes | indexes | create index duration
------+--------+------------------+-------+---------------+-------------------+---------+----------------------
16530 | public | pgbench_accounts | 2687 | 2939d16h | 248 TB | 1 | 46m17s
16527 | public | pgbench_tellers | 1 | 4s375 | 1540 kB | 1 | 14s
16533 | public | pgbench_branches | 1 | 3s193 | 96 kB | 1 | 2s309
16524 | public | pgbench_history | 1 | 3s127 | 0 B | 0 | 0ms
Step Connection Duration Transfer Concurrency
-------------------------------------------------- ---------- ---------- ---------- ------------
Catalog Queries (table ordering, filtering, etc) source 1m05s 1
Dump Schema source 611ms 1
Prepare Schema target 120ms 1
COPY, INDEX, CONSTRAINTS, VACUUM (wall clock) both 4h03m 17
COPY (cumulative) both 1d02h 328 TB 8
CREATE INDEX (cumulative) target 46m31s 1
CONSTRAINTS (cumulative) target 2s001 1
VACUUM (cumulative) target 0ms 8
Reset Sequences both 29ms 1
Large Objects (cumulative) both 4
Finalize Schema both 98ms 1
-------------------------------------------------- ---------- ---------- ---------- ------------
Total Wall Clock Duration both 4h03m 22
Is anyone else seeing this issue perhaps?
For the sake of bisection I created a couple of tests, one with a small number of data, and one with something that creates parts.
Pgcopydb args used:
pgcopydb clone \
--drop-if-exists \
--dir /tmp/pgcopydb \
--fail-fast \
--filters filters.ini \
--index-jobs 1 \
--no-owner \
--no-acl \
--no-role-passwords \
--skip-ext-comments \
--skip-large-objects \
--skip-vacuum \
--split-tables-larger-than 50000000 \
--table-jobs 8
With:
pgbench -i -s 1
We got:
17:12:29.360 340 INFO Running pgcopydb version 0.15.32.g67df98b from "/usr/local/bin/pgcopydb"
17:12:29.387 340 INFO [SOURCE] Copying database from "postgres://source@source_host:5432/sourcedb?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60"
17:12:29.387 340 INFO [TARGET] Copying database into "postgres://target@target_host:5432/targetdb?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60"
17:12:29.516 340 INFO Using work dir "/tmp/pgcopydb"
17:12:29.520 366 INFO STEP 1: fetch source database tables, indexes, and sequences
17:12:29.538 366 INFO A previous run has run through completion
17:12:32.771 366 INFO Found 3 indexes (supporting 3 constraints) in the target database
17:12:32.787 366 INFO STEP 2: dump the source database schema (pre/post data)
17:12:32.791 366 INFO Skipping pg_dump --section=pre-data, as "/tmp/pgcopydb/schema/pre.dump" already exists
17:12:32.791 366 INFO Skipping pg_dump --section=post-data, as "/tmp/pgcopydb/schema/post.dump" already exists
17:12:32.794 366 INFO STEP 3: restore the pre-data section to the target database
17:12:33.662 366 INFO ALTER DATABASE "targetdb" SET "search_path" TO '$user', 'public', 'heroku_ext';
17:12:34.080 366 INFO Drop tables on the target database, per --drop-if-exists
17:12:34.978 366 INFO /usr/bin/pg_restore --dbname 'postgres://target@target_host:5432/targetdb?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60' --single-transaction --clean --if-exists --no-owner --no-acl --use-list /tmp/pgcopydb/schema/pre-filtered.list /tmp/pgcopydb/schema/pre.dump
17:12:40.427 366 INFO Skipping tables, indexes, and sequences, already done on a previous run
17:12:40.427 366 INFO STEP 10: restore the post-data section to the target database
17:12:40.564 366 INFO /usr/bin/pg_restore --dbname 'postgres://target@target_host:5432/targetdb?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60' --single-transaction --clean --if-exists --no-owner --no-acl --use-list /tmp/pgcopydb/schema/post-filtered.list /tmp/pgcopydb/schema/post.dump
17:12:44.167 366 INFO All step are now done, 14s elapsed
17:12:44.168 366 INFO Printing summary for 4 tables and 3 indexes
OID | Schema | Name | Parts | copy duration | transmitted bytes | indexes | create index duration
------+--------+------------------+-------+---------------+-------------------+---------+----------------------
17367 | public | pgbench_accounts | 1 | 5s690 | 9266 kB | 1 | 341ms
17364 | public | pgbench_tellers | 1 | 3s289 | 91 B | 1 | 279ms
17370 | public | pgbench_branches | 1 | 3s225 | 7 B | 1 | 2s040
17361 | public | pgbench_history | 1 | 5s111 | 0 B | 0 | 0ms
Step Connection Duration Transfer Concurrency
-------------------------------------------------- ---------- ---------- ---------- ------------
Catalog Queries (table ordering, filtering, etc) source 9d12h 1
Dump Schema source 9d12h 1
Prepare Schema target 9d12h 1
COPY, INDEX, CONSTRAINTS, VACUUM (wall clock) both 17
COPY (cumulative) both 17s 12 MB 8
CREATE INDEX (cumulative) target 2s199 1
CONSTRAINTS (cumulative) target 461ms 1
VACUUM (cumulative) target 0ms 8
Reset Sequences both 1s120 1
Large Objects (cumulative) both 4
Finalize Schema both 9d12h 1
-------------------------------------------------- ---------- ---------- ---------- ------------
Total Wall Clock Duration both 14s 22
The 12 MB seems correct here. But the duration is showing over 9 days.
Now with:
pgbench -i -s 10
We get a 128 MB table reported in the initial log for pgbench_accounts, but the reporting at the end seems to be almost 2-3 times that amount (384 MB for copy cumulative and 274 MB for transmitted bytes). Something to do with going from 1 part to 3 here:
17:12:53.810 413 INFO Running pgcopydb version 0.15.32.g67df98b from "/usr/local/bin/pgcopydb"
17:12:53.839 413 INFO [SOURCE] Copying database from "postgres://source@source_host:5432/sourcedb?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60"
17:12:53.840 413 INFO [TARGET] Copying database into "postgres://target@target_host:5432/targetdb?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60"
17:12:53.968 413 INFO Using work dir "/tmp/pgcopydb"
17:12:55.301 413 INFO Exported snapshot "00000003-022F8B98-1" from the source database
17:12:56.069 439 INFO STEP 1: fetch source database tables, indexes, and sequences
17:12:58.152 439 INFO Splitting source candidate tables larger than 47 MB
17:12:58.291 439 INFO Table public.pgbench_accounts is 128 MB large, 3 COPY processes will be used, partitioning on aid.
17:12:58.305 439 INFO Fetched information for 4 tables (including 1 tables split in 3 partitions total), with an estimated total of 1000 thousands tuples and 128 MB on-disk
17:12:58.438 439 INFO Fetched information for 3 indexes (supporting 3 constraints)
17:12:58.560 439 INFO Fetching information for 0 sequences
17:12:58.833 439 INFO Fetched information for 2 extensions
17:13:00.418 439 INFO Found 0 indexes (supporting 0 constraints) in the target database
17:13:00.421 439 INFO STEP 2: dump the source database schema (pre/post data)
17:13:00.423 439 INFO /usr/bin/pg_dump -Fc --snapshot 00000003-022F8B98-1 --section pre-data --exclude-schema heroku_ext --file /tmp/pgcopydb/schema/pre.dump 'postgres://source@source_host:5432/sourcedb?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60'
17:13:11.540 439 INFO /usr/bin/pg_dump -Fc --snapshot 00000003-022F8B98-1 --section post-data --exclude-schema heroku_ext --file /tmp/pgcopydb/schema/post.dump 'postgres://source@source_host:5432/sourcedb?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60'
17:13:23.025 439 INFO STEP 3: restore the pre-data section to the target database
17:13:23.900 439 INFO ALTER DATABASE "targetdb" SET "search_path" TO '$user', 'public', 'heroku_ext';
17:13:24.352 439 INFO Drop tables on the target database, per --drop-if-exists
17:13:25.227 439 INFO /usr/bin/pg_restore --dbname 'postgres://target@target_host:5432/targetdb?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60' --single-transaction --clean --if-exists --no-owner --no-acl --use-list /tmp/pgcopydb/schema/pre-filtered.list /tmp/pgcopydb/schema/pre.dump
17:13:30.439 485 INFO STEP 4: starting 8 table-data COPY processes
17:13:30.442 439 INFO Skipping large objects, per --skip-blobs
17:13:30.442 439 INFO STEP 9: reset sequences values
17:13:30.449 488 INFO STEP 8: skipping VACUUM jobs per --skip-vacuum
17:13:30.451 491 INFO Reset sequences values on the target database
17:13:30.453 486 INFO STEP 6: starting 1 CREATE INDEX processes
17:13:30.453 486 INFO STEP 7: constraints are built by the CREATE INDEX processes
17:14:22.023 439 INFO STEP 10: restore the post-data section to the target database
17:14:22.171 439 INFO /usr/bin/pg_restore --dbname 'postgres://target@target_host:5432/targetdb?keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60' --single-transaction --clean --if-exists --no-owner --no-acl --use-list /tmp/pgcopydb/schema/post-filtered.list /tmp/pgcopydb/schema/post.dump
17:14:25.678 439 INFO All step are now done, 1m29s elapsed
17:14:25.678 439 INFO Printing summary for 4 tables and 3 indexes
OID | Schema | Name | Parts | copy duration | transmitted bytes | indexes | create index duration
------+--------+------------------+-------+---------------+-------------------+---------+----------------------
17465 | public | pgbench_accounts | 3 | 6m35s | 274 MB | 1 | 871ms
17462 | public | pgbench_tellers | 1 | 11s | 1002 B | 1 | 5s843
17468 | public | pgbench_branches | 1 | 11s | 71 B | 1 | 8s226
17459 | public | pgbench_history | 1 | 11s | 0 B | 0 | 0ms
Step Connection Duration Transfer Concurrency
-------------------------------------------------- ---------- ---------- ---------- ------------
Catalog Queries (table ordering, filtering, etc) source 4s242 1
Dump Schema source 22s 1
Prepare Schema target 7s393 1
COPY, INDEX, CONSTRAINTS, VACUUM (wall clock) both 51s 17
COPY (cumulative) both 2m45s 384 MB 8
CREATE INDEX (cumulative) target 6s897 1
CONSTRAINTS (cumulative) target 8s043 1
VACUUM (cumulative) target 0ms 8
Reset Sequences both 1s144 1
Large Objects (cumulative) both 4
Finalize Schema both 3s650 1
-------------------------------------------------- ---------- ---------- ---------- ------------
Total Wall Clock Duration both 1m29s 22
However the durations seem more accurate this time. Very strange.
I have figured out how to replicate the duration problem here for "Catalog Queries (table ordering, filtering, etc)" and "Dump schema". You'll notice in my above run, I already had a pre.dump file, so it seems that when you re-run a clone, it will show a bad duration.
In our tests, we start with a snapshot, we copy extensions, then we run our clone (we're running 0.14 and this works around the extensions issues) - so I think this is why we see the bug.
First time:
18:16:40.017 65 INFO All step are now done, 46s elapsed
18:16:40.018 65 INFO Printing summary for 4 tables and 3 indexes
OID | Schema | Name | Parts | copy duration | transmitted bytes | indexes | create index duration
------+--------+------------------+-------+---------------+-------------------+---------+----------------------
17785 | public | pgbench_accounts | 1 | 5s521 | 9266 kB | 1 | 301ms
17782 | public | pgbench_tellers | 1 | 1s032 | 91 B | 1 | 4s118
17788 | public | pgbench_branches | 1 | 2s817 | 7 B | 1 | 312ms
17779 | public | pgbench_history | 1 | 2s804 | 0 B | 0 | 0ms
Step Connection Duration Transfer Concurrency
-------------------------------------------------- ---------- ---------- ---------- ------------
Catalog Queries (table ordering, filtering, etc) source 3s961 1
Dump Schema source 22s 1
Prepare Schema target 7s366 1
COPY, INDEX, CONSTRAINTS, VACUUM (wall clock) both 9s766 17
COPY (cumulative) both 12s 12 MB 8
CREATE INDEX (cumulative) target 2s272 1
CONSTRAINTS (cumulative) target 2s459 1
VACUUM (cumulative) target 0ms 8
Reset Sequences both 1s102 1
Large Objects (cumulative) both 4
Finalize Schema both 3s580 1
-------------------------------------------------- ---------- ---------- ---------- ------------
Total Wall Clock Duration both 46s 22
And second time, same args as above with tmp/pgcopydb still around:
18:16:51.213 173 INFO Printing summary for 4 tables and 3 indexes
OID | Schema | Name | Parts | copy duration | transmitted bytes | indexes | create index duration
------+--------+------------------+-------+---------------+-------------------+---------+----------------------
17785 | public | pgbench_accounts | 1 | 5s521 | 9266 kB | 1 | 301ms
17782 | public | pgbench_tellers | 1 | 1s032 | 91 B | 1 | 4s118
17788 | public | pgbench_branches | 1 | 2s817 | 7 B | 1 | 312ms
17779 | public | pgbench_history | 1 | 2s804 | 0 B | 0 | 0ms
Step Connection Duration Transfer Concurrency
-------------------------------------------------- ---------- ---------- ---------- ------------
Catalog Queries (table ordering, filtering, etc) source 9d13h 1
Dump Schema source 9d13h 1
Prepare Schema target 1
COPY, INDEX, CONSTRAINTS, VACUUM (wall clock) both 17
COPY (cumulative) both 12s 12 MB 8
CREATE INDEX (cumulative) target 2s272 1
CONSTRAINTS (cumulative) target 2s459 1
VACUUM (cumulative) target 0ms 8
Reset Sequences both 1s102 1
Large Objects (cumulative) both 4
Finalize Schema both 1
-------------------------------------------------- ---------- ---------- ---------- ------------
Total Wall Clock Duration both 3s219 22
This feels like 2 bugs, I'm happy to split it out if required.
Hi @kbarber ; thanks for the report! I think either the SQL queries that are used to compute the sums are wrong, or maybe the way we register the data in the SQLite catalogs is wrong... or maybe both? Hey @hanefi do you think you could have a look maybe?