pgsync icon indicating copy to clipboard operation
pgsync copied to clipboard

Skip duplicate upserts.

Open mmontagna opened this issue 3 years ago • 10 comments

When using the overwrite mode, pgsync does a blind upsert which ends up creating duplicate/new row versions for ALL rows involved in the sync regardless of whether or not any of the values in those rows have changed.

This is largely fine/functionally correct, but it's not ideal as it places additional write/transaction load on the receiving database which isn't strictly needed.

It might make sense to just make this behavior the default, but I haven't fully thought that through.

Any thoughts on the idea?

mmontagna avatar Sep 14 '21 00:09 mmontagna

Hey @mmontagna, thanks for the PR! It's an interesting idea. Have you tried benchmarking it to see how much impact it has on performance?

ankane avatar Sep 16 '21 01:09 ankane

@ankane I haven't run anything too scientific. I am happy to run more benchmarks, just let me know what you want to see. I'll outline roughly what I've done so far below.

When doing a sync from local postgres to local postgres it drops the time from ~8 seconds to overwrite a ~30MB table with 130k rows (~50 columns and 7 indices) from ~8 to ~2 seconds.

When doing remote to local syncs it doesn't matter that much since the time is largely dominated by network transfer speeds. All my tests here have come out a wash but I expect for users with faster internet connections there will likely be an improvement.

I've run some tests/monitored the target machines when syncing 2GB+ tables and while it doesn't save very much, if any immediate resource cost. It does prevent accumulating vacuum workload when running frequent syncs, which is really the motivating factor here.

eg

marcomontagna@Marco-C02D603CMD6R pgsync % time bundle exec pgsync something_locations --overwrite --from postgres://localhost/dev --to postgres://localhost/other_dev --overwrite-only-changed
From: dev on localhost:5432
To: other_dev on localhost:5432
⠇ something_locations
⠦ something_locations
✔ something_locations - 1.6s
Completed in 2.0s
bundle exec pgsync something_locations --overwrite --from  --to    0.67s user 0.25s system 38% cpu 2.395 total
marcomontagna@Marco-C02D603CMD6R pgsync % time bundle exec pgsync something_locations --overwrite --from postgres://localhost/dev --to postgres://localhost/other_dev
From: dev on localhost:5432
To: other_dev on localhost:5432
⠇ something_locations
⠹ something_locations
✔ something_locations - 8.4s
Completed in 8.9s
bundle exec pgsync something_locations --overwrite --from  --to   2.86s user 0.50s system 36% cpu 9.315 total

mmontagna avatar Sep 16 '21 02:09 mmontagna

Great, just tried it with pgbench locally and also see a significant improvement.

pgbench -i -s 10 pgsync_bench1
pgbench -i -s 10 pgsync_bench2
pgsync pgbench_accounts --from pgsync_bench1 --to pgsync_bench2 --overwrite

Latest version: ~21s This PR: ~8s

This shouldn't change the result of the sync, so let's add it without an option.

ankane avatar Sep 16 '21 06:09 ankane

Alright. I've updated the PR to add it without an option.

mmontagna avatar Sep 16 '21 16:09 mmontagna

Don't merge this yet. There might be a breakage here around json columns

mmontagna avatar Sep 16 '21 20:09 mmontagna

Pushed a commit to fix that case.

mmontagna avatar Sep 16 '21 22:09 mmontagna

Good catch. I think it'll be an issue with any column types that don't support the equality operator (including custom types). I also think there may be an issue with NULL values with the current query.

Due to this, I'm not sure it's worth the extra complexity to support.

ankane avatar Sep 20 '21 19:09 ankane

Fair enough. I'm going to keep fiddling with this since it has the potential to vastly improve the performance/speed of some of our use-cases, but I won't expect you to merge anything upstream.

You're right about the null issue, that won't effect correctness, I think, but does mean this is writing more than it needs to.

mmontagna avatar Sep 20 '21 23:09 mmontagna

Sounds good. fwiw, I think the null issue will affect correctness/cause it to sync less rows than desired, so you'll probably want to test different combinations of null in the source and destination.

ankane avatar Sep 21 '21 20:09 ankane

is there any update for this? :)

Natgho avatar May 28 '24 13:05 Natgho