bdr icon indicating copy to clipboard operation
bdr copied to clipboard

Adding documentation regarding the importance of a single time source…

Open gilesw opened this issue 10 years ago • 4 comments
trafficstars

… on conflict handlers

see https://github.com/2ndQuadrant/bdr/issues/109

gilesw avatar Jul 10 '15 10:07 gilesw

Hi,

On 2015-07-10 03:14:16 -0700, gilesw wrote:

  • Adding documentation regarding the importance of a single time source on conflict handlers
  • If you have a conflict when the time on two nodes is out of sync the conflict
  • may never be able to resolved because the last update time will never match
  • even after the handler has run. This will manifest itself as row updates only
  • syncing in one direction.

That shouldn't actually happening - this should result in the "wrong row" winning, but it should nevertheless be resolved.

anarazel avatar Jul 10 '15 10:07 anarazel

Hi anarazel,

I've corrected the time source now but the steps I used to create the conflict were:-

For an update/update conflict I powered down node a and updated on node b. Then powered down node b and updated node a and powered node b back on.

conflict_id              | 860
local_node_sysid         | 6166345561721046825
local_conflict_xid       | 4990
local_conflict_lsn       | 0/CA06DD98
local_conflict_time      | 2015-07-08 16:30:29.276713+00
object_schema            | public
object_name              | table
remote_node_sysid        | 6166334043667378995
remote_txid              | 3114
remote_commit_time       | 2015-07-08 15:45:44.999168+00
remote_commit_lsn        | 1/4104FF98
conflict_type            | update_update
conflict_resolution      | last_update_wins_keep_local
local_tuple              | {"table_id":1452776,"last_update_id":"xxx","password":"obs","username":"final7","acc_id":1,"last_update_time":"2015-07-08T16:16:34.854137+00:00","make_public":false}
remote_tuple             | {"table_id":1452776,"last_update_id":"xxx","password":"obs","username":"final8","acc_id":1,"last_update_time":"2015-07-08T15:45:44.994954+00:00","make_public":false}
local_tuple_xmin         | 4988

It did have me stumped for a good while which is why I submitted the issue for the doc update. As soon as the time source was corrected though the syncing was bi-directional again.

I did try to do some more diagnosis by clearing out the conflict history to try and log each step but I got into an infinite conflict loop. If you delete the conflict history on each node you actually generated a delete/delete conflict. Do you want me to submit this as a bug or is there a purge function that I'm missing?

gilesw avatar Jul 10 '15 10:07 gilesw

If you delete the conflict history on each node you actually generated a delete/delete conflict

Hm. We don't replicate inserts into the conflict history table from the conflict tracking code, but maybe we don't filter out subsequent SQL-level update/delete on the table? I'll need to check. Creating new bug.

ringerc avatar Mar 18 '16 02:03 ringerc

It sounds like we need to reproduce this and fix the underlying bug with desynchronized time causing failure to resolve.

@gilesw Can you supply a more detailed set of steps to reproduce this? BDR setup commands, DDL, and the SQL run on each node to create the issue?

ringerc avatar Mar 18 '16 02:03 ringerc