pglogical icon indicating copy to clipboard operation
pglogical copied to clipboard

Delete Resolution: skip, then insert duplicate key

Open ujangja opened this issue 4 years ago • 7 comments

My config:

source pg9.4 postgresql94-pglogical-2.3.2-1.el6.x86_64 target pg12 postgresql12-pglogical-2.3.2-1.el7.x86_64

Have a problem only in1 table which quite busy (not so busy, its only keep delete and insert from app).

from 3 databases only on 1 database and 1 table only have this problem.

here is the log:

[199856] LOG: CONFLICT: remote DELETE on relation public.a_XXX replica identity index a_XXX_pk (tuple not found). Resolution: skip. [199856] DETAIL: remote tuple {created_by[int8]:(null) created_date[timestamp]:(null) updated_by[int8]:(null) updated_date[timestamp]:(null) device_name[varchar]:XXX5 msgid[bpchar]:12 device_or_descriptor[bpchar]:c status_data[varchar]:(null)} in xact origin=2,timestamp=2020-08-20 23:49:02.077458+08,commit_lsn=0/888F3FE0 [199856] CONTEXT: apply (unknown action) from remote relation public.a_XXX in commit before 3A7/888F3FE0, xid 930596307 committed at 2020-08-20 23:49:02.077458+08 (action #2) from node replorigin 2 [199856] ERROR: duplicate key value violates unique constraint "a_XXX_pk" [199856] DETAIL: Key (device_name, msgid, device_or_descriptor)=(XXX5, 12, c) already exists. [199856] CONTEXT: apply INSERT from remote relation public.a_XXX in commit before 3A7/888F4150, xid 930596308 committed at 2020-08-20 23:49:02.078221+08 (action #2) from node replorigin 2 [199856] LOG: apply worker [199856] at slot 3 generation 3627 exiting with error [20868] LOG: background worker "pglogical apply 16384:3307933473" (PID 199856) exited with exit code 1 [200183] LOG: starting apply for subscription sub_XXX_01 [200183] ERROR: duplicate key value violates unique constraint "a_XXX_pk" [200183] DETAIL: Key (device_name, msgid, device_or_descriptor)=(XXX5, 12, c) already exists. [200183] CONTEXT: apply INSERT from remote relation public.a_XXX in commit before 3A7/888F4150, xid 930596308 committed at 2020-08-20 23:49:02.078221+08 (action #2) from node replorigin 2 [200183] LOG: apply worker [200183] at slot 3 generation 3628 exiting with error [20868] LOG: background worker "pglogical apply 16384:3307933473" (PID 200183) exited with exit code 1

have set pglogical.conflict_log_level='debug5' but the verbosity didn't change.

am I missing something?

ujangja avatar Aug 20 '20 16:08 ujangja

I checked collation on database level and table & index columns , those are exactly identical which is en_US.UTF-8.

Reproduce using data copy from original table with its own replication set and subscriber, found that sometime delete skipped sometime delete was correct.

using psql by querying the records in the target, perfectly returned the data.

trying to check the hidden chars by several funxtions: convert, decode, md5 between source and target of the problem PK, the result exactly identical too.

so the question is how pglogical apply process can be different than what psql does?

ujangja avatar Aug 21 '20 07:08 ujangja

I checked collation on database level and table & index columns , those are exactly identical which is en_US.UTF-8.

Reproduce using data copy from original table with its own replication set and subscriber, found that sometime delete skipped sometime delete was correct.

using psql by querying the records in the target, perfectly returned the data.

trying to check the hidden chars by several funxtions: convert, decode, md5 between source and target of the problem PK, the result exactly identical too.

so the question is how pglogical apply process can be different than what psql does?

Did you ever resolve this issue - I'm experiencing something very similar, and found that certain key prefixes in certain circumstances seem to be problematic.

paulbandler avatar Nov 18 '20 16:11 paulbandler

I checked collation on database level and table & index columns , those are exactly identical which is en_US.UTF-8. Reproduce using data copy from original table with its own replication set and subscriber, found that sometime delete skipped sometime delete was correct. using psql by querying the records in the target, perfectly returned the data. trying to check the hidden chars by several funxtions: convert, decode, md5 between source and target of the problem PK, the result exactly identical too. so the question is how pglogical apply process can be different than what psql does?

Did you ever resolve this issue - I'm experiencing something very similar, and found that certain key prefixes in certain circumstances seem to be problematic.

resolved by downgrade teh target -:) , it seems pglogical on postgres12 had a bit different function so I was downgraded target db to 11.9 with pglogicaal 2.3.2-1; it amazingly thrown those intermittent errors.

ujangja avatar Nov 19 '20 02:11 ujangja

Thank you very much, down-grading to PostgreSQL 11 does indeed avoid the erratic problem I experienced where some update/delete fail on the downstream system. I also tried upgrading to Postgresql 13 but the failures occur there too.

So it appears there is a bug specific to Pglogical/Postgres versions 12 and 13.

For other’s information the problems I was experiencing were also compounded by:

  • including a redundant column in a replicated tables primary key
  • using row filtering without including the column being filtered on in the primary key - the work around for this is to have a separate replication set and separate subscription for DELETE that doesn’t filter them out. The side effect of this is that ordering isn’t preserved, so your application needs to ensure it doesn’t mix DELETE and INSERT or UPDATE in the same transaction.

On 19 Nov 2020, at 02:37, ujangja [email protected] wrote:

I checked collation on database level and table & index columns , those are exactly identical which is en_US.UTF-8. Reproduce using data copy from original table with its own replication set and subscriber, found that sometime delete skipped sometime delete was correct. using psql by querying the records in the target, perfectly returned the data. trying to check the hidden chars by several funxtions: convert, decode, md5 between source and target of the problem PK, the result exactly identical too. so the question is how pglogical apply process can be different than what psql does?

Did you ever resolve this issue - I'm experiencing something very similar, and found that certain key prefixes in certain circumstances seem to be problematic.

resolved by downgrade teh target -:) , it seems pglogical on postgres12 had a bit different function so I was downgraded target db to 11.9 with pglogicaal 2.3.2-1; it amazingly thrown those intermittent errors.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/2ndQuadrant/pglogical/issues/276#issuecomment-730087602, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACIFRG4Y6F6MJN72S7RS3LTSQSAF3ANCNFSM4QGJVIGA.

paulbandler avatar Nov 19 '20 21:11 paulbandler

We had similar issue. We found out that if we recreate the primary key on different columns excluding the text columns then replicating to PG 12 could work.

carolguo-dd avatar Sep 17 '21 03:09 carolguo-dd

This seems to be fixed in the commit f777b8bf49d3cf6f6e2ed0bf5447af34598d2ab6. Postgres 12 (postgres/postgres@5e1963fb764e9cc092e0f7b58b28985c311431d9) includes support for nondeterministic collations that requires modifications in pglogical code. pglogical 2.3.4 includes a fix for it.

eulerto avatar Sep 24 '21 13:09 eulerto

We are using AWS RDS that is on pglogical 2.3.2. Haven't tested if it works on pglogical 2.3.4

carolguo-dd avatar Nov 11 '21 20:11 carolguo-dd