mycli icon indicating copy to clipboard operation
mycli copied to clipboard

Encoding issue with some UUIDs

Open leehagoodjames opened this issue 3 years ago • 6 comments

I am using mycli to export a MySQL database to a CSV. A small portion of UUIDs are being malformed, which entirely breaks our use case.

Repro steps

Within MySQL shell (using mysql -uroot -ppassword -h127.0.0.1)

CREATE DATABASE IF NOT EXISTS encoding_fun;
USE encoding_fun;
CREATE TABLE temp (
    id  BINARY(16) PRIMARY KEY
);
INSERT INTO temp (id) VALUES ( 0xE4A82330EF764A4F9E66168B94DE85C9);
INSERT INTO temp (id) VALUES ( 0x446709595B4447CFB62E772B5A32535C); # This is the problematic id
INSERT INTO temp (id) VALUES ( 0xB697075A243211ECB1B7000D3A843826);

Display the results within the mysql shell:

mysql> SELECT id, BIN_TO_UUID(id) FROM temp;
+------------------------------------+--------------------------------------+
| id                                 | BIN_TO_UUID(id)                      |
+------------------------------------+--------------------------------------+
| 0x446709595B4447CFB62E772B5A32535C | 44670959-5b44-47cf-b62e-772b5a32535c |  // This is the problematic id
| 0xB697075A243211ECB1B7000D3A843826 | b697075a-2432-11ec-b1b7-000d3a843826 |
| 0xE4A82330EF764A4F9E66168B94DE85C9 | e4a82330-ef76-4a4f-9e66-168b94de85c9 |
+------------------------------------+--------------------------------------+

Using mycli to export data, we can see one id is malformed`.

> mycli -uroot -ppassword -h127.0.0.1 --database encoding_fun --csv -e "SELECT * FROM temp"
id
Dg	Y[DG϶.w+Z2S\                            // The mistake is this first row
0xb697075a243211ecb1b7000d3a843826
0xe4a82330ef764a4f9e66168b94de85c9

Notes / versions / etc

> mycli --version
Version: 1.24.1
> locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=

leehagoodjames avatar Oct 29 '21 16:10 leehagoodjames

Hi!

Encodings are tricky, but what's clear here is that the id value in question is being rendered as UTF-8 text.

You should be able to simplify and narrow things down by doing the SELECT on a literal:

mycli -uroot -ppassword --database encoding_fun --csv -e 'SELECT 0x446709595B4447CFB62E772B5A32535C'

MySQL also has something called "introducers" which tell it how to interpret the literal:

mycli -uroot -ppassword --database encoding_fun --csv -e 'SELECT _utf8mb4 0x446709595B4447CFB62E772B5A32535C'
mycli -uroot -ppassword --database encoding_fun --csv -e 'SELECT _binary 0x446709595B4447CFB62E772B5A32535C'

Prediction: the _binary introducer will make the issue disappear on the literal.

Similarly, SELECTing BINARY(id) should make the issue disappear when querying the table.

A couple of places to look for clues: compare the output of show variables like "%char%" in both the mycli and mysql clients. And look for default-character-set in ~/.my.cnf.

rolandwalker avatar Oct 29 '21 17:10 rolandwalker

Hi @rolandwalker, yes they are!

Here are the results from the SELECTs that you mentioned:

➜ mycli -uroot -ppassword -h127.0.0.1 --database encoding_fun --csv -e 'SELECT 0x446709595B4447CFB62E772B5A32535C'
"0x446709595B4447CFB62E772B5A32535C"
"Dg	Y[DG϶.w+Z2S\"

➜ mycli -uroot -ppassword -h127.0.0.1 --database encoding_fun --csv -e 'SELECT _utf8mb4 0x446709595B4447CFB62E772B5A32535C'
"_utf8mb4 0x446709595B4447CFB62E772B5A32535C"
"Dg	Y[DG϶.w+Z2S\"

➜ mycli -uroot -ppassword -h127.0.0.1 --database encoding_fun --csv -e 'SELECT _binary 0x446709595B4447CFB62E772B5A32535C'
"_binary 0x446709595B4447CFB62E772B5A32535C"
"Dg	Y[DG϶.w+Z2S\"

Additionally show variables like %char% in mycli shell

> show variables like "%char%"
+--------------------------+--------------------------------+
| Variable_name            | Value                          |
+--------------------------+--------------------------------+
| character_set_client     | utf8mb3                        |
| character_set_connection | utf8mb3                        |
| character_set_database   | utf8mb4                        |
| character_set_filesystem | binary                         |
| character_set_results    | utf8mb3                        |
| character_set_server     | utf8mb4                        |
| character_set_system     | utf8mb3                        |
| character_sets_dir       | /usr/share/mysql-8.0/charsets/ |
+--------------------------+--------------------------------+

And show variables like %char% in mysql shell

mysql> show variables like "%char%";
+--------------------------+--------------------------------+
| Variable_name            | Value                          |
+--------------------------+--------------------------------+
| character_set_client     | utf8mb4                        |
| character_set_connection | utf8mb4                        |
| character_set_database   | utf8mb4                        |
| character_set_filesystem | binary                         |
| character_set_results    | utf8mb4                        |
| character_set_server     | utf8mb4                        |
| character_set_system     | utf8mb3                        |
| character_sets_dir       | /usr/share/mysql-8.0/charsets/ |
+--------------------------+--------------------------------+
8 rows in set (0.01 sec)

I updated mycli variables to match mysql, so now show variables like %char% in mycli shell yields:

> show variables like "%char%";
+--------------------------+--------------------------------+
| Variable_name            | Value                          |
+--------------------------+--------------------------------+
| character_set_client     | utf8mb4                        |
| character_set_connection | utf8mb4                        |
| character_set_database   | utf8mb4                        |
| character_set_filesystem | binary                         |
| character_set_results    | utf8mb4                        |
| character_set_server     | utf8mb4                        |
| character_set_system     | utf8mb3                        |
| character_sets_dir       | /usr/share/mysql-8.0/charsets/ |
+--------------------------+--------------------------------+

However, even after these changes that id is still not represented well. See

 root@localhost:encoding_fun> SELECT * FROM temp;
+------------------------------------+
| id                                 |
+------------------------------------+
| Dg    Y[DG϶.w+Z2S\                 |
| 0xb697075a243211ecb1b7000d3a843826 |
| 0xe4a82330ef764a4f9e66168b94de85c9 |
+------------------------------------+

Also, BINARY(id) does not seem to have an effect. See:

n> SELECT BINARY(id) FROM temp;
+------------------------------------+
| BINARY(id)                         |
+------------------------------------+
| Dg    Y[DG϶.w+Z2S\                 |
| 0xb697075a243211ecb1b7000d3a843826 |
| 0xe4a82330ef764a4f9e66168b94de85c9 |
+------------------------------------+

I have a similar solution to BINARY(id), which is to use BIN_TO_UUID(id), but this is unfit for my use case which is intended to work for many tables and to be agnostic to their schemas. See:

> SELECT BIN_TO_UUID(id) FROM temp;
+--------------------------------------+
| BIN_TO_UUID(id)                      |
+--------------------------------------+
| 44670959-5b44-47cf-b62e-772b5a32535c |
| b697075a-2432-11ec-b1b7-000d3a843826 |
| e4a82330-ef76-4a4f-9e66-168b94de85c9 |
+--------------------------------------+

Per ~/.my.cnf, I do not have that file locally and am using default configs.

Are you able to reproduce this locally?

leehagoodjames avatar Oct 29 '21 21:10 leehagoodjames

Informative.

Sadly, I am not able to reproduce. But your results with the literal convince me that mycli is at least part of the problem.

rolandwalker avatar Oct 30 '21 13:10 rolandwalker

@rolandwalker would you please share what happens when you run these? Perhaps that could help identify what is different between our setups

leehagoodjames avatar Nov 03 '21 19:11 leehagoodjames

@rolandwalker would you please share what happens when you run these steps? If you are getting something different, it may help to isolate the issue.

leehagoodjames avatar Dec 13 '21 16:12 leehagoodjames

Update - I have tracked the issue to Python. The encoding issue occurs when Python performs the conversion from the hex value to bytes

>>> bytes.fromhex('E4A82330EF764A4F9E66168B94DE85C9')
b'\xe4\xa8#0\xefvJO\x9ef\x16\x8b\x94\xde\x85\xc9'
>>> bytes.fromhex('446709595B4447CFB62E772B5A32535C')
b'Dg\tY[DG\xcf\xb6.w+Z2S\\'

leehagoodjames avatar Dec 13 '21 17:12 leehagoodjames