mycli
mycli copied to clipboard
Encoding issue with some UUIDs
I am using mycli
to export a MySQL database to a CSV. A small portion of UUIDs are being malformed, which entirely breaks our use case.
Repro steps
Within MySQL shell (using mysql -uroot -ppassword -h127.0.0.1
)
CREATE DATABASE IF NOT EXISTS encoding_fun;
USE encoding_fun;
CREATE TABLE temp (
id BINARY(16) PRIMARY KEY
);
INSERT INTO temp (id) VALUES ( 0xE4A82330EF764A4F9E66168B94DE85C9);
INSERT INTO temp (id) VALUES ( 0x446709595B4447CFB62E772B5A32535C); # This is the problematic id
INSERT INTO temp (id) VALUES ( 0xB697075A243211ECB1B7000D3A843826);
Display the results within the mysql
shell:
mysql> SELECT id, BIN_TO_UUID(id) FROM temp;
+------------------------------------+--------------------------------------+
| id | BIN_TO_UUID(id) |
+------------------------------------+--------------------------------------+
| 0x446709595B4447CFB62E772B5A32535C | 44670959-5b44-47cf-b62e-772b5a32535c | // This is the problematic id
| 0xB697075A243211ECB1B7000D3A843826 | b697075a-2432-11ec-b1b7-000d3a843826 |
| 0xE4A82330EF764A4F9E66168B94DE85C9 | e4a82330-ef76-4a4f-9e66-168b94de85c9 |
+------------------------------------+--------------------------------------+
Using mycli
to export data, we can see one id
is malformed`.
> mycli -uroot -ppassword -h127.0.0.1 --database encoding_fun --csv -e "SELECT * FROM temp"
id
Dg Y[DG϶.w+Z2S\ // The mistake is this first row
0xb697075a243211ecb1b7000d3a843826
0xe4a82330ef764a4f9e66168b94de85c9
Notes / versions / etc
> mycli --version
Version: 1.24.1
> locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=
Hi!
Encodings are tricky, but what's clear here is that the id value in question is being rendered as UTF-8 text.
You should be able to simplify and narrow things down by doing the SELECT
on a literal:
mycli -uroot -ppassword --database encoding_fun --csv -e 'SELECT 0x446709595B4447CFB62E772B5A32535C'
MySQL also has something called "introducers" which tell it how to interpret the literal:
mycli -uroot -ppassword --database encoding_fun --csv -e 'SELECT _utf8mb4 0x446709595B4447CFB62E772B5A32535C'
mycli -uroot -ppassword --database encoding_fun --csv -e 'SELECT _binary 0x446709595B4447CFB62E772B5A32535C'
Prediction: the _binary
introducer will make the issue disappear on the literal.
Similarly, SELECTing BINARY(id)
should make the issue disappear when querying the table.
A couple of places to look for clues: compare the output of show variables like "%char%"
in both the mycli and mysql clients. And look for default-character-set
in ~/.my.cnf
.
Hi @rolandwalker, yes they are!
Here are the results from the SELECTs
that you mentioned:
➜ mycli -uroot -ppassword -h127.0.0.1 --database encoding_fun --csv -e 'SELECT 0x446709595B4447CFB62E772B5A32535C'
"0x446709595B4447CFB62E772B5A32535C"
"Dg Y[DG϶.w+Z2S\"
➜ mycli -uroot -ppassword -h127.0.0.1 --database encoding_fun --csv -e 'SELECT _utf8mb4 0x446709595B4447CFB62E772B5A32535C'
"_utf8mb4 0x446709595B4447CFB62E772B5A32535C"
"Dg Y[DG϶.w+Z2S\"
➜ mycli -uroot -ppassword -h127.0.0.1 --database encoding_fun --csv -e 'SELECT _binary 0x446709595B4447CFB62E772B5A32535C'
"_binary 0x446709595B4447CFB62E772B5A32535C"
"Dg Y[DG϶.w+Z2S\"
Additionally show variables like %char%
in mycli
shell
> show variables like "%char%"
+--------------------------+--------------------------------+
| Variable_name | Value |
+--------------------------+--------------------------------+
| character_set_client | utf8mb3 |
| character_set_connection | utf8mb3 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | utf8mb3 |
| character_set_server | utf8mb4 |
| character_set_system | utf8mb3 |
| character_sets_dir | /usr/share/mysql-8.0/charsets/ |
+--------------------------+--------------------------------+
And show variables like %char%
in mysql
shell
mysql> show variables like "%char%";
+--------------------------+--------------------------------+
| Variable_name | Value |
+--------------------------+--------------------------------+
| character_set_client | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | utf8mb4 |
| character_set_server | utf8mb4 |
| character_set_system | utf8mb3 |
| character_sets_dir | /usr/share/mysql-8.0/charsets/ |
+--------------------------+--------------------------------+
8 rows in set (0.01 sec)
I updated mycli
variables to match mysql
, so now show variables like %char%
in mycli
shell yields:
> show variables like "%char%";
+--------------------------+--------------------------------+
| Variable_name | Value |
+--------------------------+--------------------------------+
| character_set_client | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | utf8mb4 |
| character_set_server | utf8mb4 |
| character_set_system | utf8mb3 |
| character_sets_dir | /usr/share/mysql-8.0/charsets/ |
+--------------------------+--------------------------------+
However, even after these changes that id
is still not represented well. See
root@localhost:encoding_fun> SELECT * FROM temp;
+------------------------------------+
| id |
+------------------------------------+
| Dg Y[DG϶.w+Z2S\ |
| 0xb697075a243211ecb1b7000d3a843826 |
| 0xe4a82330ef764a4f9e66168b94de85c9 |
+------------------------------------+
Also, BINARY(id)
does not seem to have an effect. See:
n> SELECT BINARY(id) FROM temp;
+------------------------------------+
| BINARY(id) |
+------------------------------------+
| Dg Y[DG϶.w+Z2S\ |
| 0xb697075a243211ecb1b7000d3a843826 |
| 0xe4a82330ef764a4f9e66168b94de85c9 |
+------------------------------------+
I have a similar solution to BINARY(id)
, which is to use BIN_TO_UUID(id)
, but this is unfit for my use case which is intended to work for many tables and to be agnostic to their schemas. See:
> SELECT BIN_TO_UUID(id) FROM temp;
+--------------------------------------+
| BIN_TO_UUID(id) |
+--------------------------------------+
| 44670959-5b44-47cf-b62e-772b5a32535c |
| b697075a-2432-11ec-b1b7-000d3a843826 |
| e4a82330-ef76-4a4f-9e66-168b94de85c9 |
+--------------------------------------+
Per ~/.my.cnf
, I do not have that file locally and am using default configs.
Are you able to reproduce this locally?
Informative.
Sadly, I am not able to reproduce. But your results with the literal convince me that mycli is at least part of the problem.
@rolandwalker would you please share what happens when you run these? Perhaps that could help identify what is different between our setups
@rolandwalker would you please share what happens when you run these steps? If you are getting something different, it may help to isolate the issue.
Update - I have tracked the issue to Python. The encoding issue occurs when Python performs the conversion from the hex value to bytes
>>> bytes.fromhex('E4A82330EF764A4F9E66168B94DE85C9')
b'\xe4\xa8#0\xefvJO\x9ef\x16\x8b\x94\xde\x85\xc9'
>>> bytes.fromhex('446709595B4447CFB62E772B5A32535C')
b'Dg\tY[DG\xcf\xb6.w+Z2S\\'