airbyte
airbyte copied to clipboard
Source MySQL: support `utf8_bin` charset
Environment
- Airbyte version: 0.35.12
- OS Version / Instance: example macOS, Windows 7/10, Ubuntu 18.04, GCP n2. , AWS EC2
- Source Connector and version: Mysql 0.5.1
- Severity: Medium
- Step where error happened: Sync job
Current Behavior
Replication of columns that are using utf8_bin charset results in base64 encoded data.
Raw data example:

Expected Behavior
Syncs should replicate decoded utf8 strings.
Logs
Source table schema:
CREATE TABLE `INV_LOT_LOC_ID` (
`organizationId` varchar(20) COLLATE utf8_bin NOT NULL,
`warehouseId` varchar(20) COLLATE utf8_bin NOT NULL,
`locationId` varchar(60) COLLATE utf8_bin NOT NULL DEFAULT '',
`lotNum` varchar(10) COLLATE utf8_bin NOT NULL,
`traceId` varchar(30) COLLATE utf8_bin NOT NULL,
`customerId` varchar(30) COLLATE utf8_bin NOT NULL,
`sku` varchar(50) COLLATE utf8_bin NOT NULL,
`qty` decimal(18,8) NOT NULL,
`qtyAllocated` decimal(18,8) NOT NULL,
`qtyRpIn` decimal(23,8) NOT NULL,
`qtyRpOut` decimal(23,8) NOT NULL,
`qtyMvIn` decimal(23,8) NOT NULL,
`qtyMvOut` decimal(23,8) NOT NULL,
`qtyOnHold` decimal(23,8) NOT NULL,
`onHoldLocker` int(11) NOT NULL DEFAULT '0',
`grossWeight` decimal(18,8) NOT NULL,
`netWeight` decimal(18,8) NOT NULL,
`cubic` decimal(18,8) NOT NULL,
`price` decimal(24,8) DEFAULT NULL,
`lpn` varchar(30) COLLATE utf8_bin DEFAULT NULL,
`qtyPa` decimal(18,8) NOT NULL DEFAULT '0.00000000',
`qcStatus` varchar(2) COLLATE utf8_bin DEFAULT NULL,
`lastMaintenanceDate` timestamp NULL DEFAULT NULL,
`noteText` mediumtext COLLATE utf8_bin,
`udf01` varchar(500) COLLATE utf8_bin DEFAULT NULL,
`udf02` varchar(500) COLLATE utf8_bin DEFAULT NULL,
`udf03` varchar(500) COLLATE utf8_bin DEFAULT NULL,
`udf04` varchar(500) COLLATE utf8_bin DEFAULT NULL,
`udf05` varchar(500) COLLATE utf8_bin DEFAULT NULL,
`currentVersion` int(11) NOT NULL DEFAULT '100',
`oprSeqFlag` varchar(65) COLLATE utf8_bin NOT NULL DEFAULT '2016',
`addWho` varchar(40) COLLATE utf8_bin DEFAULT NULL,
`addTime` timestamp NULL DEFAULT NULL,
`editWho` varchar(40) COLLATE utf8_bin DEFAULT NULL,
`editTime` timestamp NULL DEFAULT NULL,
`inLocTime` timestamp NULL DEFAULT NULL,
PRIMARY KEY (`organizationId`,`warehouseId`,`locationId`,`lotNum`,`traceId`),
KEY `auto_shard_key_organizationId` (`organizationId`),
KEY `IDX_INV_LOT_LOC_ID_C` (`organizationId`,`warehouseId`,`locationId`,`sku`),
KEY `IDX_INV_LOT_LOC_ID_QQQQQ` (`organizationId`,`warehouseId`,`qty`,`qtyPa`,`qtyMvIn`,`qtyRpIn`,`qtyAllocated`),
KEY `IDX_LOTXLOCXID_SKU` (`organizationId`,`warehouseId`,`sku`),
KEY `I_INV_LOT_LOC_ID_A` (`organizationId`,`warehouseId`,`customerId`,`sku`),
KEY `I_INV_LOT_LOC_ID_B` (`organizationId`,`warehouseId`,`traceId`),
KEY `I_INV_LOT_LOC_ID_C` (`organizationId`,`warehouseId`,`qty`,`qtyMvIn`,`qtyRpIn`),
KEY `I_INV_LOT_LOC_ID_LOT` (`organizationId`,`warehouseId`,`lotNum`),
KEY `I_INV_LOT_LOC_ID_LOC` (`locationId`),
KEY `I_INV_LOT_LOC_ID_SKU` (`organizationId`,`customerId`,`sku`),
KEY `I_INV_LOT_LOC_ID_LOT2` (`organizationId`,`lotNum`),
KEY `I_INV_LOT_LOC_ID_LO` (`lotNum`,`organizationId`),
KEY `I_INV_LOT_LOC_ID_SCW` (`sku`,`customerId`,`warehouseId`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin
Steps to Reproduce
- Create a MySQL table with
varchar(500) COLLATE utf8_bin DEFAULT NULL - Replicate the table
- Check the output data
@tuliren suggests that we are maybe using the JDBC Field#isBinary method in a wrong way.
Same here when use utf8mb4_bin
Sample table:
CREATE TABLE `store` (
`id` bigint NOT NULL,
`name` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin DEFAULT NULL,
PRIMARY KEY (`id`) USING BTREE
) ENGINE=MyISAM DEFAULT CHARSET=utf8
I dont think this is a bug, this is expected behaviour, we handle binary data by converting it to base64 encoded string.
@bleonard @misteryeo any thoughts on @subodh1810 's comment above?
Moving to "blocked" until we decide if we should change this behavior
Let's leave this for now until we get more signal on this.
@subodh1810 @grishick Not sure about this one. The column type isn't BINARY, it's VARCHAR. The encoding seems like it should not influence the resulting destination table. I would think our results would map to the average client browsing this database.
I haven't used this encoding before. When I do SELECT * FROM table does it return "regular" strings or something else.
Ref: https://dev.mysql.com/doc/connector-j/8.0/en/connector-j-reference-charsets.html
@alafanechere , @bleonard Can you explain what the use case her (the linked slack thread is no longer available) I'm trying to understand what the right thing to do here would be. Specifically, why would someone encode their db using anything but standard, e.g utf8_bin and why would someone choose specifically a _bin encoding rather than just utfXX ?
I have no idea, but I would think it would be fine as long as results are ordered in a predictable order.
@rodireich I don't think I ever got the exact reason from the user why they picked this encoding. I think the user was not managing the DB so could not tell me extra details. It looked like a niche problem at the time. Maybe sharing clearly in doc what encoding we support and how this one is specifically handled could help (as @subodh1810 suggested).
Thanks for the input.
I think what we can ensure is:
- If a non binary encoding such as UTF-16 is used - we need to ensure that destination side looks identical to source in whatever destination encoding is. That is if you have "abcd" on source it should look like "abcd" on destination. If you have 1234 on source it should remain 1234 on destination.
- If a binary encoding is used such as UTF16_bin - we need to ensure that on destination side we create a base64 equivalent, so a source "abcd" will show as "YWJjZA==".
The reason for encoding binary source charsets to base64 is to ensure that whatever is on source, which may include binary data that cannot be represented by a string, is preserved.
And to clearly document what encodings are supported and what the expected behavior is
I think I see the problem here. In this case the data is not even encoded in anything binary. It's only the collation (sorting) order that is binary. We should treat it as a normal string.
One thing I learned: charset may be defined on a server, db, table or column level. Any combination is acceptable. e.g. table encoded with charset X with a column encoded with Y.
We have informed technical support about this change. Baking the message in for a few days. Estimated time to release: May 18
This is now getting released.