dolt icon indicating copy to clipboard operation
dolt copied to clipboard

Invalid string for charset `utf8mb4`

Open aschnappauf opened this issue 10 months ago • 12 comments

Feel free to adjust the title to something more meaningful

I've got a table in the following form:

CREATE TABLE Products(
  id INTEGER PRIMARY KEY,
  name TEXT
);

Into which we forcefully inserted strings with a wrong encoding (latin1) so that we get the following output:

SELECT id, HEX(name) FROM Products;

/*
+----+------------------+
| id |     HEX(name)    |
+----+------------------+
| 1  | 446F6C744C6162AE |  => 'DoltLab®' encoded in latin1
+----+------------------+
*/

Querying

Regular selection does not work (unsurprisingly)

SELECT name FROM Products;

/* invalid string for charset utf8mb4: '[44 111 108 116 76 97 98 174]' */

Conversion does kinda work but not as intended

SELECT CONVERT(name USING latin1) FROM Products;

/*
+----------------------------+
| CONVERT(name USING latin1) |
+----------------------------+
| DoltLab?                   |
+----------------------------+
*/

Binary encoding does roughly as expected:

SELECT CONVERT(name USING binary) FROM Products;

/*
+----------------------------+
| CONVERT(name USING binary) |
+----------------------------+
| DoltLab�                  |
+----------------------------+
*/

Forcing binary interpretation before converting to latin1 does not improve results:

SELECT CONVERT(CONVERT(name USING binary) USING latin1) FROM Products;

/*
+--------------------------------------------------+
| CONVERT(CONVERT(name USING binary) USING latin1) |
+--------------------------------------------------+
| DoltLab?                                         |
+--------------------------------------------------+
*/

Casting plain refuses to do anything:

SELECT CAST(name AS CHAR) FROM Products;

/*
+--------------------+
| CAST(name AS CHAR) |
+--------------------+
| NULL               |
+--------------------+
*/

Casting to BINARY yields the same result as CONVERT(name USING binary). Converting a binary-casted string also yields the same result as the CONVERT-CONVERT strategy.

Replacing the faulty bytes also does not work.

SELECT REPLACE(CAST(name AS BINARY), x'AE', x'2B') FROM Products;

/* invalid string for charset utf8mb4: '[44 111 108 116 76 97 98 174]' */

Version

Tested using Dolt 1.49.3

aschnappauf avatar Feb 26 '25 14:02 aschnappauf

I was finally able to replicate the issue here, and I think I've found a way to fix it.

First, we should convert the column to a BLOB, as the conversion does not verify the integrity of each string.

ALTER TABLE Products MODIFY name BLOB;

From here, we are free to modify the strings, so we can replace the bad string with the appropriate string for utf8mb4 encoding.

UPDATE Products SET name = UNHEX('446F6C744C6162C2AE') WHERE name = UNHEX('446F6C744C6162AE');

Once the strings in question have been replaced, we'll convert back to our TEXT type.

ALTER TABLE Products MODIFY name TEXT;

Our standard SELECT statement now works as expected:

SELECT name FROM Products;
/*
+----------+
|   name   |
+----------+
| DoltLab® |
+----------+
*/

Let me know if this resolves the issue for you!

Hydrocharged avatar Feb 27 '25 13:02 Hydrocharged

It's worth mentioning that we don't have to use UNHEX for the UPDATE, but it makes it much clearer the values that are being modified (especially since it's binary data in the BLOB format).

Also, if returning the column back to its standard TEXT type is throwing additional errors (since it verifies that all strings are valid during the conversion), you can easily visualize the different strings using a query such as:

SELECT CONVERT(name USING utf8mb4), HEX(name) FROM Products;

Any strings with the placeholder can be modified through an UPDATE.

Hydrocharged avatar Feb 27 '25 13:02 Hydrocharged

For me, the ALTER TABLE throws an issue for invalid strings; Apparently it does check the integrity

aschnappauf avatar Feb 28 '25 11:02 aschnappauf

I wonder if there's some other invalid state, since I made sure it would work on my repro beforehand. You're specifically referring to this ALTER correct?

ALTER TABLE Products MODIFY name BLOB;

I doubt this will work, but we should still try it for the sake of thoroughness. What if we use VARBINARY instead, so it's:

ALTER TABLE Products MODIFY name VARBINARY(16000);

If this still does not work, would you mind pushing your repository to DoltHub? Or some sanitized subset that may remove any private information, but still exhibits the problem? It would help with debugging immensely.

Hydrocharged avatar Feb 28 '25 23:02 Hydrocharged

@aschnappauf , thanks for using Dolt! We'd love to learn more about your use case. Feel free to email me or swing by our Discord if you want to share.

bpf120 avatar Mar 04 '25 21:03 bpf120

I wonder if there's some other invalid state, since I made sure it would work on my repro beforehand. You're specifically referring to this ALTER correct?

ALTER TABLE Products MODIFY name BLOB; I doubt this will work, but we should still try it for the sake of thoroughness. What if we use VARBINARY instead, so it's:

ALTER TABLE Products MODIFY name VARBINARY(16000); If this still does not work, would you mind pushing your repository to DoltHub? Or some sanitized subset that may remove any private information, but still exhibits the problem? It would help with debugging immensely.

The VARBINARY still throws an error. I also noticed, that i actually have multiple such columns, maybe during ALTER TABLE all other columns are checked for correctness?

I will try to generate a sanitized subset, dont know if i can both strip protected information and keep the errorneous behavior

aschnappauf avatar Mar 07 '25 10:03 aschnappauf

Having multiple columns may make a difference, since I was only testing with one column having invalid data. I'll try this out and see if it makes a difference.

Hydrocharged avatar Mar 10 '25 14:03 Hydrocharged

I've been able to confirm that having multiple columns with bad data will prevent the solution that I've posted above. In simplified terms, when we modify a column, rather than validating just the changed column, we validate the entire row. The point at which we're validating is a common path to try and prevent any kind of invalid data from being written (which somehow didn't catch the original data insertion), and therefore isn't as straightforward as disabling row validation on an ALTER COLUMN specifically.

I'm looking into the best way forward for you to be able to fix your invalid rows. I really appreciate your patience as we work through this.

Hydrocharged avatar Mar 11 '25 13:03 Hydrocharged

If a solution comes with quite some complexity, i can also re-export the data from our old database. Not optimal, but needs to be done anyway at some point. However, it may be beneficial if someone else runs into a similar issue to have some way to resolve.

One possible way would be the following:

  1. Convert all affected columns to hex and insert separator to identify bytes
    UPDATE table SET col = REGEXP_REPLACE(HEX(col), '(..)', '$1,'); 
    
    (It took me quite some time to notice that capture groups are referenced with $n and not \n because MariaDB uses the latter whereas Dolt has no notes about it) The inserted commas are for cases where bytes next to each other happen to produce a hex-sequence that looks like wrong bytes (e.g. "1AE2" contains the substring "AE", however, it doesn't contain the byte "AE")
  2. Replace all problematic characters (here '®' from latin1 'AE' to utf8mb4 'C2AE')
    UPDATE table SET col = REPLACE(col, 'AE,', 'C2,AE,');
    
  3. Remove the commas and return to string
    UPDATE table SET col = UNHEX(REPLACE(col, ',', ''));
    

Quite involved but it does work, even with multiple errorneous columns. Unfortunately, one needs to know every wrong character in the dataset and replace them/look up the correct encoding separately (so, possibly execute step two multiple times).

aschnappauf avatar Mar 12 '25 09:03 aschnappauf

Thank you for sharing your workaround!

With the column conversion via ALTER TABLE, one would still need to know all of the wrong characters, so your workaround should be functionally equivalent for most cases. So for you, it seems like re-exporting would be the best solution. I'd love to dig into how this erroneous insertion occurred in the first place, as the best solution would be to prevent these errors from happening, but that's much harder said than done since we've nothing to really go off of (and my attempts at tracing potential code paths did not bear any fruit).

Regarding how we handle capture groups, this is actually a quirk of our regex library. MariaDB uses PCRE, while MySQL uses ICU. To mirror MySQL, we added a way to use ICU from Go, so that we could use it for Dolt, hence the divergence between the two.

Hydrocharged avatar Mar 12 '25 12:03 Hydrocharged

I used an internal tool that connects to the mysql server and pushes data to it. Apparently, this happened without defining a character encoding during connection establishment and dolt seemingly does not verify/convert incoming data this way. If it matters, i used the FireDAC library and libmysql.dll from the MySQL website. When specifying CharacterEncoding=utf8mb4 on connect, everything works fine. I assume the strings are then converted on the client side beforehand

aschnappauf avatar Mar 14 '25 09:03 aschnappauf

Interesting, thank you for the information. I'll dig into this a bit more and see what I can find, although it may be a bit as we've had a few other high priority tasks come up recently that I'm working through, but I'll keep you updated on this issue when I have something!

Hydrocharged avatar Mar 17 '25 20:03 Hydrocharged