kitchen_sync MySQL CI -> PostgreSQL collation-dependent varchar primary keys result in 'Duplicate key value violates unique constraint'

Syncing MySQL to PostgreSQL with Kitchen Sync 2.10.

The first sync is successful, on the consequent sync right after the first one (i.e. syncing the same data from the source) get the following error:

starting course_overviews_courseoverview
Error in the 'to' worker: ERROR:  duplicate key value violates unique constraint "course_overviews_courseoverview_pkey"
DETAIL:  Key (id)=(course-v1:xxx+xx2020+2020_T1) already exists.
INSERT INTO "db"."course_overviews_courseoverview" ("created", "modified", "version", "id", "_location", "display_name", "display_number_with_default", "display_org_with_default", "start"...
Error in the 'from' worker: Connection closed
Kitchen Syncing failed.

Apr 07 '21 17:04 easeev

Could you please paste in the full CREATE TABLE statements for both source and destination databases, and also check what character set and collation is default for both source and destination databases?

Apr 13 '21 09:04 willbryant

Also what command-line options are you using? I wouldn't expect it to try and put a table into the "db" schema by default.

Apr 13 '21 09:04 willbryant

Could you please paste in the full CREATE TABLE statements for both source and destination databases, and also check what character set and collation is default for both source and destination databases?

Hi

I'm working with easeev and can explain the issue.

Create table statement

MySQL:

CREATE TABLE `course_overviews_courseoverview` (
  `created` datetime(6) NOT NULL,
  `modified` datetime(6) NOT NULL,
  `version` int(11) NOT NULL,
  `id` varchar(255) NOT NULL,
  ... skipped a lot of fields ...
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

PostgreSQL (from KS log):

CREATE TABLE "course_overviews_courseoverview" (
  "created" timestamp without time zone NOT NULL,
  "modified" timestamp without time zone NOT NULL,
  "version" integer NOT NULL,
  "id" character varying(255) NOT NULL,
  ... skipped a lot of fields ...
  PRIMARY KEY("id"))

In both cases charset is UTF8 and id contains only ASCII chars.

Destination schema name in fact is not "db". I think Eugene named it "db" for shorthand.

P.S. As a workaround I'm calling "drop schema .. cascade; create schema ...;" before KS.

Sep 08 '21 13:09 EBCEEB

OK. Could you please check the collation as well as the character set? By default mysql's will be case insensitive and postgresql's will be case-sensitive, but you can change that default at database and/or table and/or column level.

Unfortunately KS does not know about these collations (because they're not portable) so the KS output won't show them to you.

Sep 09 '21 01:09 willbryant

MySQL:

| character_set_database | utf8            |
| collation_database     | utf8_general_ci |

Create table statement (see above) specifies only charset - utf8.

PostgreSQL database has charset UTF8 and collation/ctype en_US.UTF8. Table is created by KS without specific charset/collation (see above).

Sep 09 '21 06:09 EBCEEB

I've checked source (MySQL) table - there is no id's similar to the problematic one.

Sep 09 '21 06:09 EBCEEB

Weird. That's my main theory gone. I'll try and reproduce.

If you run with -v, does it seem like it errors straight away when it does that table, or are there several successful key ranges processed first?

Sep 09 '21 08:09 willbryant

Sorry, can't check. I've added "drop schema; create schema" on the target DB (PG) and and KS runs successfully.

I.e. sync succeeds if there is no tables/data in target DB and KS creates tables and fills it with data, but fails if data already there.

Sep 09 '21 10:09 EBCEEB

Sorry for the delay. I think I've finally figured out why this would happen.

The only way I know of to create this problem is if the two databases sort the primary key values differently.

For example, if MySQL sorted the rows ABCDEF but PostgreSQL sorted them ABCEDF, then if KS happens to choose key range C-E as one of the pieces of the table to sync, PostgreSQL would show that row D does not exist in that range of rows. At that point KS would retrieve it and insert it, which would fail like the error you have above.

This kind of problem is easy to create if you have different collations so in general syncing between tables with different charsets/collations in the PK is not supported.

Unfortunately, KS does not specify the charset or collation when it creates a table/column itself. I would like to add support for it, but ironically, the blocker is actually that those charset and collation names are not portable between databases, so I don't know how to go about converting them between MySQL and PostgreSQL. I could add support just when syncing to the same type of database, but that would not fix your case.

Anyway, I believe that the problem you have experienced is occurring because MySQL is defaulting to a case-insensitive collation. This means that 'A' and 'a' map to the same index in the collation order.

PostgreSQL on the other hand, defaults to case-sensitive collations. Therefore, 'A' and 'a' will have different indexes in the collation order.

This is enough to create the ABCDEF/ABCEFD problem I described above because MySQL will effectively move rows with one case to the order they would have if they had the other case.

In this case, I think the workaround is probably pretty simple, you could either change the source table's collation (for that one column, or the whole table if you like) to a case-sensitive collation. If you have no desire to treat keys with the same English characters but different case as the same key, I would suggest this is probably best anyway.

Or, you could do the opposite, and change the destination table's collation to a case-insensitive collation. It's unusual to see those in the PostgreSQL world, but they do exist... if you are using PG 12+ (before that we had to use citext, which I don't think I do anything intelligent with in KS yet - I will think about adding a special case for them).

What version of PostgreSQL are you using?

Sep 19 '21 00:09 willbryant

What version of PostgreSQL are you using?

We're using PostgreSQL 12 at destination

Sep 20 '21 14:09 EBCEEB

I had a bit of a go at making a PG12 collation that would match MySQL's, but I haven't found it. Creating a case-insensitive collation is easy enough but they don't sort the symbols into the same place, so that won't necessarily work for you. It can probably be done through the myriad ICU options, but I don't know how.

If you aren't specifically trying to support case-insensitivity with your primary key, my suggestion is that you to use a case-sensitive collation at the MySQL end for these key fields. This will take away the unusual sorting behavior, and then it should match PostgreSQL.

So if you are happy with that solution, you could try ALTER TABLE course_overviews_courseoverview CHANGE COLUMN id id varchar(255) NOT NULL COLLATE utf8_bin.

I guess I need to think about adding a special case in KS to detect the case when syncing tables with character primary keys and different collations at the two ends, since we can't reliably divide them into ranges consistently at the two ends. Ultimately I'd just have to clear and reload those tables, so no more efficient than what you're doing, but KS is meant to work out of the box even if there are a few tables that don't sync efficiently.

Sep 26 '21 03:09 willbryant

kitchen_sync kitchen_sync copied to clipboard

MySQL CI -> PostgreSQL collation-dependent varchar primary keys result in 'Duplicate key value violates unique constraint'

kitchen_sync
kitchen_sync copied to clipboard