pglogical Slow initial data sync problems

I am attempting to use pglogical on AWS RDS to migrate a 9.6 instance (pglogical 2.2.2) to 13.3 (pglogical 2.3.3). The provider database size (tables, indexes, etc) is around 2TB. We're seeing what we believe to be a very slow initial data sync between the two databases. In ~19hours, only ~95GB has been synced between the two database. We don't see any obvious bottlenecks here.

I/O on the provider is pretty low:

Write I/O on the subscriber has been much higher at a steady 40-50MB/sec:. This is interesting, because with a sustained write I/O of 40-50MB/sec, I would expect that the subscriber would have much more than 95GB loaded in 19 hours. What am I missing?

CPU usage on the provider is very low, and somewhat high on the subscriber, but it's not pegged:

The instance types for both the subscriber and provider are m6g.large which appear to be capable of MUCH more I/O.

Relevant Postgres (parameter group) settings:

max_wal_senders: 20
max_wal_size: 2048
max_replication_slots: 10
max_worker_processes: 8

Is this performance expected? Is there anything I can do to speed it up?

Jul 06 '21 13:07 joshuabaird

A 2 TB migration should be rehearsed before it is executed in production so you know how long it's gonna take. It's hard to guess at what the RDS bottlenecks are. Have you tried to ask AWS for production support.

Have you considered taking a snapshot first and parallel bulk loading it? Then you can catchup the pg13 snapshot with incremental transactions that have occurred whilst the initialization has taken place. IN my experience, planning and rehearsal are the way to avoid the dilemma your are in. Right now it's looking look you aren't gonna know for sure for several weeks whether your first "trial" is gonna work (it likely will not).

--Luss (Former Product Manager for Postgres at AWS)

On Tue, Jul 6, 2021 at 9:29 AM Josh Baird @.***> wrote:

I am attempting to use pglogical on AWS RDS to migrate a 9.6 instance (pglogical 2.2.2) to 13.3 (pglogical 2.3.3). The provider database size (tables, indexes, etc) is around 2TB. We're seeing what we believe to be a very slow initial data sync between the two databases. In ~19hours, only ~95GB has been synced between the two database. We don't see any obvious bottlenecks here.

I/O on the provider is pretty low:

[image: image] https://user-images.githubusercontent.com/6271918/124607416-d3a25500-de3b-11eb-928d-8a30894baf14.png

Write I/O on the subscriber has been much higher at a steady 40-50MB/sec:. This is interesting, because with a sustained write I/O of 40-50MB/sec, I would expect that the subscriber would have much more than 95GB loaded in 19 hours. What am I missing?

[image: image] https://user-images.githubusercontent.com/6271918/124607564-f92f5e80-de3b-11eb-8898-f38855143d87.png

CPU usage on the provider is very low, and somewhat high on the subscriber, but it's not pegged:

[image: image] https://user-images.githubusercontent.com/6271918/124607669-1106e280-de3c-11eb-9f5d-d2f46eae8cd9.png

The instance types for both the subscriber and provider are m6g.large which appear to be capable of MUCH more I/O.

Relevant Postgres (parameter group) settings:

max_wal_senders: 20 max_wal_size: 2048 max_replication_slots: 10 max_worker_processes: 8

Is this performance expected? Is there anything I can do to speed it up?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/2ndQuadrant/pglogical/issues/325, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMWOHS443JXX4GXMFL2MC3TWMALZANCNFSM474SV2MQ .

Jul 06 '21 14:07 luss

Hi @luss! We met in NYC a few years ago at an AWS event when the RDS Postgres Customer Advisory Board was established.

Yes - we are rehearsing the upgrade now. This isn't being done in production. No, we haven't engaged RDS support yet, but we will.

Could you clarify the snapshot-workflow that you're referring to? I think you're describing something like this:

Restore a snapshot of the provider (PG9) database to a new PG9 instance
Upgrade the new PG9 instance to PG13 (using the "click" method via the AWS console/API/etc)
Configure pglogical replication between the provider (PG9) and subscriber (now PG13) databases

At this point, are you saying that pglogical will sync any changes that have occurred since the snapshot was originally taken?

Jul 06 '21 15:07 joshuabaird

That's funny... I knew your name was familiar, but, I wasn't sure from where.

On Tue, Jul 6, 2021 at 11:29 AM Josh Baird @.***> wrote:

Hi @luss https://github.com/luss! We met in NYC a few years ago at an AWS event when the RDS Postgres Customer Advisory Board was established.

Yes - we are rehearsing the upgrade now. This isn't being done in production. No, we haven't engaged RDS support yet, but we will.

Could you clarify the snapshot-workflow that you're referring to? I think you're describing something like this:

Restore a snapshot of the provider (PG9) database to a new PG9 instance

Upgrade the new PG9 instance to PG13 (using the "click" method via the AWS console/API/etc)

Configure pglogical replication between the provider (PG9) and subscriber (now PG13) databases

At this point, are you saying that pglogical will sync any changes that have occurred since the snapshot was originally taken?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/2ndQuadrant/pglogical/issues/325#issuecomment-874860927, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMWOHXLPY6DXUFGHXEY54LTWMOODANCNFSM474SV2MQ .

Jul 06 '21 18:07 luss

In my testing, it seems like pglogical isn't aware of where it needs to be begin replicating when you pre-seed data (using the snapshot method referenced above). It looks like pglogical is trying to replicate data from the provider that already exists on the subscriber (as a result from the seed):

2021-07-06 20:42:35 UTC:10.180.21.178(49174):postgres@mydb:[2834]:LOG:  statement: COPY "public"."core_test" ("id","name","parent_id","low_price","mid_price","high_price","order","in_style_survey","display_name") FROM stdin
2021-07-06 20:42:36 UTC:10.180.21.178(49174):postgres@mydb:[2834]:ERROR:  duplicate key value violates unique constraint "core_test_pkey"
2021-07-06 20:42:36 UTC:10.180.21.178(49174):postgres@mydb:[2834]:DETAIL:  Key (id)=(90) already exists.

Am I missing something? Is it actually possible to pre-seed the subscriber with data and then ask pglogical to pickup replication in the right place?

Jul 06 '21 21:07 joshuabaird

@joshuabaird I have a similar issue with Google SQL - I am trying to migrate a 2TB Database and the initial sync is incredibly slow. I too am seeing slow network throughput of around 2MB\s between two medium spec Google SQL instances.

I know I haven't provided in-depth specs etc but just wondered if you resolved the issue or took the alternative approach i.e. pre-seed?

Jan 11 '22 14:01 sparkacus

I overcame some of my issues with taking a slightly different approach:

After importing the schema into the subscriber - drop all of the foreign key constraints. This will allow you to manually sync each table which provides a few benefits - it's typically faster and it's easier/quicker to recover from unexpected errors because you don't have to resync every table if you run into problems
Tweak your PG config on provider+subscriber. Specifically, look into tuning max_wal_size, checkpoint_timeout and synchronous_commit. Excessive checkpoints are very expensive.
Create a new subscription with synchronize_data := 'false'
Manually resync each table: select * from pglogical.alter_subscription_resynchronize_table('subscription_name', 'table_name');

Once you have synced all tables individually, you can add the foreign keys back to the tables. You may also try dropping indexes on the subscriber and re-creating them once all tables are in sync.

Also - make sure your instances are not I/O bound. We have found that some tables (for some unknown reason) are very I/O instensive+slow, and others are fast.

Good luck!

Jan 11 '22 14:01 joshuabaird

Appreciate for all the useful info here! I'm also hitting this issue where initial sync is incredibly slow and was hoping to pre-seed the database with a snapshot or direct import to start.

I found some useful posts on starting from a snapshot on RDS , but I'm unfortunately on Google SQL and haven't found an obvious way to get the LSN / am wary of data loss from this approach. I was hoping that starting with a snapshot I'd be able to simply restart replication and have it begin in the correct place, but alas I ended up on this thread.

@sparkacus - were you able to successfully pre-seed from a snapshot on CloudSQL?

@joshuabaird if you don't mind sharing, what max_wal_size etc tuning proved most successful for you? I've been attempting to tune these parameters, but am still seeing very slow throughput + high memory utilization at initialization.

Sep 27 '22 21:09 mmidzik

pglogical pglogical copied to clipboard

Slow initial data sync problems

pglogical
pglogical copied to clipboard