airbyte icon indicating copy to clipboard operation
airbyte copied to clipboard

Source MySQL: support `utf8_bin` charset

Open alafanechere opened this issue 3 years ago • 5 comments

Environment

  • Airbyte version: 0.35.12
  • OS Version / Instance: example macOS, Windows 7/10, Ubuntu 18.04, GCP n2. , AWS EC2
  • Source Connector and version: Mysql 0.5.1
  • Severity: Medium
  • Step where error happened: Sync job

Current Behavior

Replication of columns that are using utf8_bin charset results in base64 encoded data.

Related slack conversation

Raw data example: Screenshot 2022-01-19 183339

Expected Behavior

Syncs should replicate decoded utf8 strings.

Logs

Source table schema:

CREATE TABLE `INV_LOT_LOC_ID` (
  `organizationId` varchar(20) COLLATE utf8_bin NOT NULL,
  `warehouseId` varchar(20) COLLATE utf8_bin NOT NULL,
  `locationId` varchar(60) COLLATE utf8_bin NOT NULL DEFAULT '',
  `lotNum` varchar(10) COLLATE utf8_bin NOT NULL,
  `traceId` varchar(30) COLLATE utf8_bin NOT NULL,
  `customerId` varchar(30) COLLATE utf8_bin NOT NULL,
  `sku` varchar(50) COLLATE utf8_bin NOT NULL,
  `qty` decimal(18,8) NOT NULL,
  `qtyAllocated` decimal(18,8) NOT NULL,
  `qtyRpIn` decimal(23,8) NOT NULL,
  `qtyRpOut` decimal(23,8) NOT NULL,
  `qtyMvIn` decimal(23,8) NOT NULL,
  `qtyMvOut` decimal(23,8) NOT NULL,
  `qtyOnHold` decimal(23,8) NOT NULL,
  `onHoldLocker` int(11) NOT NULL DEFAULT '0',
  `grossWeight` decimal(18,8) NOT NULL,
  `netWeight` decimal(18,8) NOT NULL,
  `cubic` decimal(18,8) NOT NULL,
  `price` decimal(24,8) DEFAULT NULL,
  `lpn` varchar(30) COLLATE utf8_bin DEFAULT NULL,
  `qtyPa` decimal(18,8) NOT NULL DEFAULT '0.00000000',
  `qcStatus` varchar(2) COLLATE utf8_bin DEFAULT NULL,
  `lastMaintenanceDate` timestamp NULL DEFAULT NULL,
  `noteText` mediumtext COLLATE utf8_bin,
  `udf01` varchar(500) COLLATE utf8_bin DEFAULT NULL,
  `udf02` varchar(500) COLLATE utf8_bin DEFAULT NULL,
  `udf03` varchar(500) COLLATE utf8_bin DEFAULT NULL,
  `udf04` varchar(500) COLLATE utf8_bin DEFAULT NULL,
  `udf05` varchar(500) COLLATE utf8_bin DEFAULT NULL,
  `currentVersion` int(11) NOT NULL DEFAULT '100',
  `oprSeqFlag` varchar(65) COLLATE utf8_bin NOT NULL DEFAULT '2016',
  `addWho` varchar(40) COLLATE utf8_bin DEFAULT NULL,
  `addTime` timestamp NULL DEFAULT NULL,
  `editWho` varchar(40) COLLATE utf8_bin DEFAULT NULL,
  `editTime` timestamp NULL DEFAULT NULL,
  `inLocTime` timestamp NULL DEFAULT NULL,
  PRIMARY KEY (`organizationId`,`warehouseId`,`locationId`,`lotNum`,`traceId`),
  KEY `auto_shard_key_organizationId` (`organizationId`),
  KEY `IDX_INV_LOT_LOC_ID_C` (`organizationId`,`warehouseId`,`locationId`,`sku`),
  KEY `IDX_INV_LOT_LOC_ID_QQQQQ` (`organizationId`,`warehouseId`,`qty`,`qtyPa`,`qtyMvIn`,`qtyRpIn`,`qtyAllocated`),
  KEY `IDX_LOTXLOCXID_SKU` (`organizationId`,`warehouseId`,`sku`),
  KEY `I_INV_LOT_LOC_ID_A` (`organizationId`,`warehouseId`,`customerId`,`sku`),
  KEY `I_INV_LOT_LOC_ID_B` (`organizationId`,`warehouseId`,`traceId`),
  KEY `I_INV_LOT_LOC_ID_C` (`organizationId`,`warehouseId`,`qty`,`qtyMvIn`,`qtyRpIn`),
  KEY `I_INV_LOT_LOC_ID_LOT` (`organizationId`,`warehouseId`,`lotNum`),
  KEY `I_INV_LOT_LOC_ID_LOC` (`locationId`),
  KEY `I_INV_LOT_LOC_ID_SKU` (`organizationId`,`customerId`,`sku`),
  KEY `I_INV_LOT_LOC_ID_LOT2` (`organizationId`,`lotNum`),
  KEY `I_INV_LOT_LOC_ID_LO` (`lotNum`,`organizationId`),
  KEY `I_INV_LOT_LOC_ID_SCW` (`sku`,`customerId`,`warehouseId`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin

Steps to Reproduce

  1. Create a MySQL table with varchar(500) COLLATE utf8_bin DEFAULT NULL
  2. Replicate the table
  3. Check the output data

alafanechere avatar Jan 31 '22 20:01 alafanechere

@tuliren suggests that we are maybe using the JDBC Field#isBinary method in a wrong way.

alafanechere avatar Jan 31 '22 20:01 alafanechere

Same here when use utf8mb4_bin

Sample table:

CREATE TABLE `store` (
  `id` bigint NOT NULL,
  `name` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin DEFAULT NULL,
  PRIMARY KEY (`id`) USING BTREE
) ENGINE=MyISAM DEFAULT CHARSET=utf8

quannh-uet avatar Feb 20 '22 17:02 quannh-uet

I dont think this is a bug, this is expected behaviour, we handle binary data by converting it to base64 encoded string.

subodh1810 avatar Sep 13 '22 08:09 subodh1810

@bleonard @misteryeo any thoughts on @subodh1810 's comment above?

grishick avatar Sep 13 '22 21:09 grishick

Moving to "blocked" until we decide if we should change this behavior

grishick avatar Sep 13 '22 21:09 grishick

Let's leave this for now until we get more signal on this.

misteryeo avatar Sep 14 '22 22:09 misteryeo

@subodh1810 @grishick Not sure about this one. The column type isn't BINARY, it's VARCHAR. The encoding seems like it should not influence the resulting destination table. I would think our results would map to the average client browsing this database.

I haven't used this encoding before. When I do SELECT * FROM table does it return "regular" strings or something else.

bleonard avatar Sep 30 '22 20:09 bleonard

Ref: https://dev.mysql.com/doc/connector-j/8.0/en/connector-j-reference-charsets.html

rodireich avatar Apr 24 '23 21:04 rodireich

@alafanechere , @bleonard Can you explain what the use case her (the linked slack thread is no longer available) I'm trying to understand what the right thing to do here would be. Specifically, why would someone encode their db using anything but standard, e.g utf8_bin and why would someone choose specifically a _bin encoding rather than just utfXX ?

rodireich avatar Apr 24 '23 21:04 rodireich

I have no idea, but I would think it would be fine as long as results are ordered in a predictable order.

bleonard avatar Apr 24 '23 21:04 bleonard

@rodireich I don't think I ever got the exact reason from the user why they picked this encoding. I think the user was not managing the DB so could not tell me extra details. It looked like a niche problem at the time. Maybe sharing clearly in doc what encoding we support and how this one is specifically handled could help (as @subodh1810 suggested).

alafanechere avatar Apr 25 '23 16:04 alafanechere

Thanks for the input.

I think what we can ensure is:

  1. If a non binary encoding such as UTF-16 is used - we need to ensure that destination side looks identical to source in whatever destination encoding is. That is if you have "abcd" on source it should look like "abcd" on destination. If you have 1234 on source it should remain 1234 on destination.
  2. If a binary encoding is used such as UTF16_bin - we need to ensure that on destination side we create a base64 equivalent, so a source "abcd" will show as "YWJjZA==".

The reason for encoding binary source charsets to base64 is to ensure that whatever is on source, which may include binary data that cannot be represented by a string, is preserved.

And to clearly document what encodings are supported and what the expected behavior is

rodireich avatar Apr 25 '23 16:04 rodireich

I think I see the problem here. In this case the data is not even encoded in anything binary. It's only the collation (sorting) order that is binary. We should treat it as a normal string.

rodireich avatar Apr 26 '23 16:04 rodireich

One thing I learned: charset may be defined on a server, db, table or column level. Any combination is acceptable. e.g. table encoded with charset X with a column encoded with Y.

rodireich avatar Apr 26 '23 16:04 rodireich

We have informed technical support about this change. Baking the message in for a few days. Estimated time to release: May 18

prateekmukhedkar avatar May 10 '23 17:05 prateekmukhedkar

This is now getting released.

rodireich avatar May 24 '23 14:05 rodireich