gh-ost gh-ost fails to appply binlog event for latin1 character set with Error 1366

We have a table with latin1 character set. We insert a row with multibyte data into the table using latin1 connection, while gh-ost runs.Then gh-ost fails with Error 1366 Incorrect string value: '\xEF\xBF\xBD' for column 'pad' at row 1.

Executed command

gh-ost --chunk-size=1000 --max-lag-millis=30000 --host='127.0.0.1' --port=3307 --user=test --password=test --database='test' --table='sbtest3' --alter="ENGINE=InnoDB;" --allow-on-master  --cut-over=atomic --assume-rbr --serve-socket-file=/tmp/59e1b122-a587-4e45-a692-ea5d2380d1e9.sock -skip-foreign-key-checks -debug --execute 1>/tmp/59e1b122-a587-4e45-a692-ea5d2380d1e9.log 2>&1 &

While gh-ost running, we insert a row into table.

set names latin1; insert into sbtest3 values(1, '1', '1,', '威');

Then gh-ost applying binlog into _gho table , error 1366 happens.

 Error 1366: Incorrect string value: '\xEF\xBF\xBD' for column 'pad' at row 1; query=
                        replace /* gh-ost `test`.`_sbtest3_gho` */ into
                                `test`.`_sbtest3_gho`
                                        (`id`, `k`, `c`, `pad`)
                                values
                                        (?, ?, ?, ?)

gh-ost version

gh-ost --version
1.1.4

Aug 03 '22 00:08 wangzihuacool

After analyzing and testing, I found that the problem is encoding.NewDecoder() in types.go

if encoding, ok := charsetEncodingMap[this.Charset]; ok {
			arg, _ = encoding.NewDecoder().String(s)
}

So, I added debug before and after this code to print the bytes corresponding to the characters，output below：

2022/08/03 14:44:24 ariginal column data:  威
2022/08/03 14:44:24 decoded column data:  å¨�
2022/08/03 14:44:24 bytes column data before decoded:  [229 168 129]  
2022/08/03 14:44:24 bytes column data after decoded:  [195 165 194 168 239 191 189]
2022-08-03 14:44:24 ERROR Error 1366: Incorrect string value: '\xEF\xBF\xBD' for column 'pad' at row 1; query=
                        replace /* gh-ost `test`.`_sbtest3_gho` */ into
                                `test`.`_sbtest3_gho`
                                        (`id`, `k`, `c`, `pad`)
                                values
                                        (?, ?, ?, ?)
                ; args=[1 1 1 å¨�]

As a comparison, Column data in table

mysql> select pad, hex(pad) from sbtest3 where c=1;
+-----+----------+
| pad | hex(pad) |
+-----+----------+
| 威  | E5A881   |
+-----+----------+

As we can see, bytes before decoded: [229 168 129] , that's "E5A881" bytes after decoded: [195 165 194 168 239 191 189] , that's "C3A5C2A8EFBFBD" obviously, bytes are not equal before and after Decoder().

A posible solution : Since Latin1 is a single-byte encoding, and all 256 values of 1 byte are fully occupied, theoretically, any encoded value can be stored in the Latin1 field.We convert the data of character type into bytes, and then write the single-byte characters of latin1 to the table, so there is no problem of transcoding

Aug 03 '22 07:08 wangzihuacool

:wave: @wangzihuacool thanks for the issue and PR!

Since Latin1 is a single-byte encoding, and all 256 values of 1 byte are fully occupied, theoretically, any encoded value can be stored in the Latin1 field.We convert the data of character type into bytes, and then write the single-byte characters of latin1 to the table, so there is no problem of transcoding

The problem with this approach is that it depends on undocumented MySQL behaviour that could change in a future release.

I think the underlying issue is storing multi-byte characters using single-byte encoding, which isn't what latin1 is meant to be used for, so I don't think it is a good idea to make this change to gh-ost. I think the correct solution is to either use an appropriate multi-byte characterset or varbinary (but I know it might not be straightforward to change the characterset!).

This probably isn't the answer you were hoping for, but I hope this makes sense! 🙂

Aug 08 '22 17:08 dm-2

I'm sorry, my previous description about latin1 was a bit misleading.

Actually, the main change here is to convert the column value parsed by the binlog event into binary strings, instead of utf8 encoded unicode. We know that MySQL does not encode binary strings. when the Applier connects to DB and writes binary strings to the column of the ghost table, there's no need to consider the character set. That is, the connection character set which defined in connection.go can be latin1, utf8, or any other value.

In this case, character set of the connection does not matter.

Aug 09 '22 08:08 wangzihuacool

@dm-2 yes, the underlying issue is storing multi-byte characters using single-byte encoding. It may also occur in other scenarios, such as storing three-byte characters using two-byte encoding like gbk. We can't prevent these situations in advance.

As a popular online schema migration solution, I think gh-ost can make small changes to accommodate these scenarios. For comparison, pt-ost does not involve character encoding conversion and will not cause Error 1366 Incorrect string value problem.

In addition, gh-ost uses binary string to process the value of the column corresponding to the unique key too.

insert /* gh-ost `test`.`sbtest3` */ ignore into `test`.`_sbtest3_gho` (`id`, `k`, `c`, `pad`)
(select `id`, `k`, `c`, `pad` from `test`.`sbtest3` force index (`PRIMARY`)
where (((`id` > _binary'198000')) and ((`id` < _binary'199000') or ((`id` = _binary'199000')))) lock in share mode

replace /* gh-ost `test`.`_sbtest3_gho` */ into
  `test`.`_sbtest3_gho` (`id`, `k`, `c`, `pad`)
values
  (1111122, 0, _binary'威', _binary'')

Aug 09 '22 08:08 wangzihuacool

@wangzihuacool sorry about the delay getting back to you! I've had a bit more of a think about this, and I think it makes sense for gh-ost to trust the bytes that it reads from the binlogs and insert them as bytes into the *_gho tables without doing any characterset conversion (unless the schema change being done is a characterset conversion) 👍

Sep 06 '22 10:09 dm-2

Merged #1158

Thanks @wangzihuacool for your contribution! 🙇

Sep 06 '22 10:09 dm-2

gh-ost gh-ost copied to clipboard

gh-ost fails to appply binlog event for latin1 character set with Error 1366

gh-ost
gh-ost copied to clipboard