mysql2sqlite icon indicating copy to clipboard operation
mysql2sqlite copied to clipboard

Some practice about blob and Chinese character.

Open crupest opened this issue 5 years ago • 26 comments

I open this issue to talk about my practice about blob and Chinese character handling during the migration. I'm not quite familiar with the character handling in mysql and this migration script but I do meet some problem when migration. I think this message might help others who meet the same question like me.

I run the dump with --compact --skip-extended-insert but the text column, which contains Chinese characters and emoji, in the generated dump is garbled and migration fails. The character set of mysql server is utf8mb4. Then I figure out it and try with additional --hex-blob --default-character-set=utf8mb4. Then the text column works right. However the blob column then become 0xaaaaaaa..., which is very long and sqlite can't read them. Then I manually replace them with x'aaaaaa...', which is the blob literal in sqlite. Until now, everything works well and I run the migration script successfully.

Update 1

For curiosity, I run the script without --hex-blob and inspect the generated sql. The Chinese characters and emoji are not garbled but the blob seems also be translated as if it were text. So my question is the script can't deal with blob?

crupest avatar Feb 21 '20 02:02 crupest

So my question is the script can't deal with blob?

Actually you're right, mysql2sqlite doesn't have any special measures to work with blobs in a guaranteed correct way.

I think your solution(s) is something we could adopt in this script (e.g. --hex-blob came out of several other discussions in this tracker lately). Would you be willing to make a PR (changing README.md as well as mysql2sqlite)?

dumblob avatar Feb 21 '20 08:02 dumblob

Sorry, I'm not quite familiar with shell script so I can't edit the script. But I do can help update the README.md to help more people if you want.

crupest avatar Feb 21 '20 10:02 crupest

Ok, let's do at least the README :wink:. Btw. you wrote

However the blob column then become 0xaaaaaaa..., which is very long and sqlite can't read them.

What does it mean sqlite can't read them? What's the exact error message and return code (echo $?)?

dumblob avatar Feb 21 '20 10:02 dumblob

ubuntu@VM-0-7-ubuntu:~/mysql2sqlite/mysql2sqlite-master$ ./mysql2sqlite timeline_mysql2.sql | sqlite3 temp.db
memory
90: WARN Hex number trimmed (length longer than 16 chars).
91: WARN Hex number trimmed (length longer than 16 chars).

Sorry for being unclear. It seems it's error message of mysql2sqlite. It trims the extra content beyond 16 chars. I'm not quite sure.

crupest avatar Feb 21 '20 10:02 crupest

It trims the extra content beyond 16 chars.

It does. That's because it thinks it's a number and sqlite supports numbers just up to 16 significant digits.

Could you provide the SQL code you're using as input for mysql2sqlite? I'd try to do some changes to avoid this script to think it's a number if it's actually a binary blob...

dumblob avatar Feb 21 '20 11:02 dumblob

Yes. I only provide what I think is valuable. If you need any other info, please tell me.

INSERT INTO `user_avatars` VALUES (1,0x89504E470D0A1A0A000...E44AE426082,'image/png','KeYEM97d7nWjNC5aRndxoGimlvQ=','2019-12-04 22:06:27.885391',2);

This is the insert line in sql dump file. And the second column is just a blob. In which the ... is a very long sequence consists of [0-9A-F] (aka, hex digit). And my suggestion is just to remove the prefix 0x and use single quote ' to surround them and add a prefix x or X. Then it becomes x'89504E470D0A1A0A000...E44AE426082', which is legal sqlite blob literal as the docs says:

BLOB literals are string literals containing hexadecimal data and preceded by a single "x" or "X" character. Example: X'53514C697465'

BTW, as far as I'm concerned, you can distinguish blob and text by the fact that text is still string like 'hello' but blob is hex number. But I don't know the exact behavior in case that other format might also be used by the dump program.

crupest avatar Feb 21 '20 12:02 crupest

I've looked at sqlite and mysql and mariadb docs and actually it shall be doable like follows:

  1. remove completely the check for too long hexadecimal integers (and blindly trust the input as sqlite won't produce any warning if the integer got converted to float - see below)
  2. change readme mysqldump command to always use --hex-blob, but state, that even without --hex-blob everything should work quite well (--hex-blob affects BINARY, VARBINARY, the BLOB types, and BIT while this script already converts e.g. BIT to an integer which will be further needed only if --hex-blob won't be used)
  3. convert 0xABCD to x'ABCD' and 0xABC to x'0ABC' (see http://download.nust.na/pub6/mysql/doc/refman/5.1/en/hexadecimal-values.html for odd number of hex chars)

I think we can sacrifice (1) because if we wouldn't do that, we would need to make this script way smarter that it is currently by keeping track which tables have which types in which columns and then at each insert check for that. That's a lot of work with negligible gain (sqlite will fully automatically, but irreversibly, convert too big integers to floats and treat them like floats disregarding what the column type says though there are some smart conversions to integer back for such auto-converted floats-wanna-be-ints if utterly needed).

I won't have time for this now, so if anyone wants to contribute, feel free to do it :wink: (I think it's suitable for newcomers as well).

dumblob avatar Feb 21 '20 16:02 dumblob

I think one of the key points is whether the mysql dump will present values of other type in the form of hex number, aka 0xaaaa. If it doesn't, then it is safe to convert all of them into hex blob literal aka x'aaaa'. Because under my tests, sqlite will not convert hex blob literal to number automatically and it will be saved as blob even if the column is of type INTEGER, 'REALorNUMERIC`. So if only blob is presented as hex number by mysql dump, it will be much easy to write the script, or values of other type will be messed up unless we detect the column type in a smarter way.

@dumblob If you think it's ok to convert all hex number into hex blob literal, I can learn the shell script and awk and try to make a pull request. (I think it will not be that difficult and it is good to learn a new skill.😁) Or if you have other thoughts, don't hesitate to tell me!

crupest avatar Feb 22 '20 10:02 crupest

I think one of the key points is whether the mysql dump will present values of other type in the form of hex number, aka 0xaaaa.

Well, I have hopes that mysqldump will never produce hexadecimal output unless explicitly forced to either by --hex-blob or other explicit option (which I don't know of, but it might exist) fully independent from the underlying DB schema and/or DB settings.

Could you prove this (in)valid by looking at the mysqldump source code of MariaDB or by asking on MariaDB mailing lists? I'm currently out of time (sorry for my late responses).

If you think it's ok to convert all hex number into hex blob literal

I actually don't think it's OK, but I still believe there is a chance that mysqldump won't produce any hexadecimal notation unless explicitly told to - but let's wait for the source code analysis and/or feedback from MariaDB community as outlined above.

dumblob avatar Feb 25 '20 17:02 dumblob

I've sent a post about this problem at MySQL forum here waiting for reply.

crupest avatar Feb 27 '20 16:02 crupest

I just stumbled over this problem, and I was really relieved to find this thread.
Thanks to dumblob's post, I fixed the problem locally in my repo, by replacing the block for INSERT statements by the following code :

# print all INSERT lines
( /^ *\(/ && /\) *[,;] *$/ ) || /^(INSERT|insert|REPLACE|replace)/ {
  prev = ""

  # first replace \\ by \_ that mysqldump never generates to deal with
  # sequnces like \\n that should be translated into \n, not \<LF>.
  # After we convert all escapes we replace \_ by backslashes.
  gsub( /\\\\/, "\\_" )

  # empty strings are delimited by double quote rather than single quotes
  gsub( /''/, "\"\"" )
  # single quotes are escaped by another single quote
  gsub( /\\'/, "''" )
  gsub( /\\n/, "\n" )
  gsub( /\\r/, "\r" )
  gsub( /\\"/, "\"" )
  gsub( /\\\032/, "\032" )  # substitute char

  gsub( /\\_/, "\\" )

  # get content of VALUES statement
  if( match( $0, /VALUES \((.*)\);$/ ) ) {
    valuesStart = RSTART
    remaining = substr($0, RSTART+8, RLENGTH-10 )
    
    # get each individual value
    hexIssue = 0
    insValues = ""
    while( match( remaining, /(^[^',]+(,|$))|(^'([^']|'')+'(,|$))/ ) ) {
      if( RLENGTH == length(remaining) ) {
        # it is the last value, so we don't add a comma after
        valLen = RLENGTH
        comma = ""
      }
      else {
        valLen = RLENGTH-1
        comma = ","
      }

      # now we have the value that will be inserted!
      val = substr(remaining, RSTART, valLen)

      # truncate remaining to keep only unprocessed part
      remaining = substr(remaining, RSTART+RLENGTH)

      # convert hex number to hex string if needed
      if( match( val, /^0x[0-9a-fA-F]+$/ ) ) {
        hexIssue = 1
        hexStart = RSTART+2
        hexLen = RLENGTH-2
        zeroFiller = hexLen % 2 == 1 ? "0" : ""
        val = "x'" zeroFiller substr( val, hexStart, hexLen ) "'"
      }

      # concat the modified value to the previous ones
      insValues = insValues val comma
    }
    # replace VALUES statement with modified statement
    $0 = substr( $0, 0, valuesStart-1 ) "VALUES (" insValues ");" 
    
    if( hexIssue ){
      printerr( \
        NR ": WARN Hex number converted to hex string (see mysqldump --hex-blob option)." )
    }
  }

  print
  next
}

I am not much knowledgeable about awk, and even less about Sqlite format, so I won't propose a PR, but maybe this could help someone later on.

Thanks for your work guys ! mysql2sqlite rocks 👍

Edit 21/03/2020: I was a bit too naive when proposing this patch the first time ... I had problems with base64 encoded values and other stuff. Sorry. I updated the code above to be more robust.

vdechef avatar Mar 19 '20 20:03 vdechef

@vdechef thanks for implementing it! Before I'll merge it, I'd definitely like to know the answer to the question @crupest posted to MySQL forum.

dumblob avatar Mar 19 '20 20:03 dumblob

@dumblob while waiting for experts to answer his question, I checked the code of mysqldump to see if there are other cases than blobs that use hexadecimal notation.
I got the code from MariaDB repository : https://github.com/MariaDB/server/blob/10.5/client/mysqldump.c

I found 2 usages for "0x" :

  • blobs when using --hex-blob option (at line 4007 and line 4083)
  • options values for SELECT...INTO statement (through calls to add_load_option at line 3825)

As far as I understand SELECT...INTO documentation and mysqldump code, and given that the code to parse blobs only apply to INSERT and REPLACE statements, I guess we can safely assume that there won't be hexadecimal fields inside of the VALUES (...) part of those statements except for the blobs.

Please don't be mistaken, my goal is not to hurry you to merge the patch (I have a local copy in my repo that works for my use cases), I only want to help.

vdechef avatar Mar 21 '20 14:03 vdechef

I've taken a look at your updated patch and first it surprised me a lot, but after short investigation of what mysqldump actually dumps I can't see any simpler robust solution.

I think you can definitely make a PR out of it (you made a working patch, so you should get the credit for it). In that case we'd need some "unit" tests (I still couldn't find the time to consolidate and make them work - as commit policy), then I'd add a comment like # as of 2020-03-21 mysqldump --hex-blob is the sole way to get hexadecimal syntax in the whole INSERT/REPLACE output except for substrings of e.g. base64 etc. and last we also need to test it with more awk versions (e.g. macOS awk, FreeBSD awk, GNU awk).

dumblob avatar Mar 21 '20 15:03 dumblob

However, I don't think we can get the answer only by search 0x. Is it possible that someone insert some value as hex and the dump just copy the input so there is hex in the output even if there is no explicit 0x in code?

That's definitely true, but the goal of this script is to be fully compatible just with mysqldump with a specific set of arguments - i.e. without any postprocessing.

On the other hand I don't want anyone to get an sqlite DB without knowing our conversion script did something potentially very dangerous - therefore the clear message for each line where this conversion happened.

I'll look into the code later when I get spare time.

Always welcome - keep us posted.

dumblob avatar Mar 21 '20 15:03 dumblob

@dumblob In fact, I was talking about the mysqldump code and whether it will generate hex in other situations.😅 But I was just afraid my opinion was kind of immature and would be wrong, so I deleted it hastily and I can't restore it. Sorry.😅 It's lucky that you have made a reference copy.

Here is some findings of mysqldump

At line 3973

is_blob= (opt_hex_blob && field->charsetnr == 63 &&
                  (field->type == MYSQL_TYPE_BIT ||
                   field->type == MYSQL_TYPE_STRING ||
                   field->type == MYSQL_TYPE_VAR_STRING ||
                   field->type == MYSQL_TYPE_VARCHAR ||
                   field->type == MYSQL_TYPE_BLOB ||
                   field->type == MYSQL_TYPE_LONG_BLOB ||
                   field->type == MYSQL_TYPE_MEDIUM_BLOB ||
                   field->type == MYSQL_TYPE_TINY_BLOB ||
                   field->type == MYSQL_TYPE_GEOMETRY)) ? 1 : 0;

and at line 4062 !(field->flags & NUM_FLAG). If both is true, then at line 4081, hex blob is written.

Data are handled between approximately 4062 and 4113.

I'll look into how other types of data are dealt with tomorrow!

Best regards!

crupest avatar Mar 21 '20 16:03 crupest

@crupest You are right, I also though about data being inserted as hex by someone, then dumped by mysqldump. But my guess was that those values would be dumped as decimal integers, or hex strings, because I don't see any kind of hex formating in the printf() calls done by mysqldump. The patch I made only cares about INSERT VALUES (xxxxxx) statements, and convert only hex values that are present in xxxxxx statements, ignoring the strings inside xxxxxx.

This leads to 2 remarks :

  • we will need to check if blobs can be written into mysqldump output in other locations than insert/replace statements
  • we will have to continue your work on verifying each type of value that may be dumped as hex, and this can be painful :(

vdechef avatar Mar 21 '20 17:03 vdechef

I read the codes between line 4062 and 4113 and here is my conclusion about how mysqldump columns:

  • For non-number, if is_blob is true and hex-blob is specified, hex for row[i] is ouput, otherwise call unescape on row[i] to output.
  • For number, if it is NaN or infinite, NULL is output, if it is decimal, row[i] surrounded by ' is output, otherwise row[i] is output.

What is row[i]? At line 3940 row= mysql_fetch_row(res) and mysql_fetch_row is a C library function in mysql, which returns MYSQL_ROW. You can find info about it here. It is an byte string array with each element containing the column data. However I couldn't find the format in which different types of data is stored in the byte string. One of the problems I thought is that if user use hex to insert a number, will mysqldump output a hex in the byte string?

So I did a simple test. Under mysql 8.0.19 on Window x64, I insert a int with hex and run mysqldump. It ouputs non-hex.

That's all I've found.

Best regards.

crupest avatar Mar 22 '20 14:03 crupest

I did some research in MySQL data types documentation, and also some tests to validate that what I understood from documentation was right.
Here are my findings.

Available data types in MySQL :

  • Numeric
    • INTEGER, INT, SMALLINT, TINYINT, MEDIUMINT, BIGINT : exported as integers, in base 10
    • DECIMAL: exported as numbers, in base 10
    • FLOAT, DOUBLE : exported as numbers, in base 10
    • BIT : exported as BLOB
    • => all numeric values are exported in base 10 by mysql and mysqldump. The only exception is BIT, that is exported as BLOB, so everything is OK
  • Date and time
    • existing types : DATE, TIME, DATETIME, TIMESTAMP, and YEAR
    • all are inserted as string, or as number like 20200322 for 22th of march 2020 (hex values are not accepted)
    • the only exception is YEAR, that can be inserted as hex. But as explained above mysql will consider it internally as a numeric value and lose details about the format it was inserted with
    • => no risk of hex value starting with 0x, so it is OK
  • String
    • BINARY, VARBINARY, BLOB : stored and exported as BLOB
    • CHAR, VARCHAR, TEXT, ENUM, and SET : stored and exported as string
    • NB: using charset_binary for text strings is equivalent to using BINARY/VARBINARY/BLOB
    • => only 2 options for export (string and blob) so this is OK
  • Spatial
    • stored and exported as BLOB
    • => this is OK
  • JSON
    • stored in a binary format internally by MySQL
    • inserted as string : INSERT INTO t1 VALUES('{"key1": "value1", "key2": "value2"}');
    • => could not find any reference to JSON in mysqldump, but given that it is inserted as text it is OK

@crupest The numbers provided by user as hex will be stored in binary format internally by mysql. So the format in which it was inserted will not be kept, and we don't need to worry about this. Once a value is stored as numeric in database, it won't be dumped as hexadecimal but as number in base 10

vdechef avatar Mar 22 '20 17:03 vdechef

Some thoughts about corner cases :

  • what about extended inserts options ? It will break mysql2sqlite script, shouldn't we warn about it ?
  • what about INSERT DELAYED ? It seems like this is not supported by mysqldump, but I am not sure about it

vdechef avatar Mar 22 '20 20:03 vdechef

@vdechef one more thing - currently the script supports conversion of BIT to a number. According to what you found out, we can remove this conversion from this script as all BITs will be output only as hex, right?

dumblob avatar Mar 23 '20 10:03 dumblob

@dumblob As long as you use --hex-blob, yes, it will be output as hex.

vdechef avatar Mar 23 '20 10:03 vdechef

what about extended inserts options ? It will break mysql2sqlite script, shouldn't we warn about it ?

Such cases should be solved by disabling them by an argument to mysqldump if possible and without side effects. Thoughts?

what about INSERT DELAYED? It seems like this is not supported by mysqldump, but I am not sure about it

Same as above - could we disable that with an argument if mysqldump suports that?

As long as you use --hex-blob, yes, it will be output as hex.

Thinking about it further, I'd leave it in the script if someone really can't or doesn't want to use the --hex-blob option, then BITs will get converted. But we should probably write to the source code, that it's not recommended and one should use --hex-blob.

dumblob avatar Mar 23 '20 10:03 dumblob

Thinking about it further, I'd leave it in the script if someone really can't or doesn't want to use the --hex-blob option, then BITs will get converted. But we should probably write to the source code, that it's not recommended and one should use --hex-blob.

In fact, I misunderstood your point : I didn't realize that mysql2sqlite convert BIT to integer not only in the INSERT statement, but also in the table CREATE statement. So either mysql2sqlite has to convert BIT to BLOB in the CREATE statement, or it needs to convert the dumped value from blob (hex or string) to integer.

IMO the second solution is not good, because it is more complex to implement, and it does not reflect the BIT type correctly (BIT is meant to be a bit field, not an integer).

If we use BLOB for BIT, user can always call hex() to get a printable representation of the field value :

SELECT hex(col_bit) FROM mytable;

vdechef avatar Mar 23 '20 12:03 vdechef

what about extended inserts options ? It will break mysql2sqlite script, shouldn't we warn about it ?

Such cases should be solved by disabling them by an argument to mysqldump if possible and without side effects. Thoughts?

Seems legit :)
So mysql2sqlite will have to force usage of --skip-extended-insert option of mysqldump

what about INSERT DELAYED? It seems like this is not supported by mysqldump, but I am not sure about it

Same as above - could we disable that with an argument if mysqldump suports that?

Same as above: mysql2sqlite will have to forbid usage of --delayed option of mysqldump

vdechef avatar Mar 23 '20 12:03 vdechef

In fact, I misunderstood your point : I didn't realize that mysql2sqlite convert BIT to integer not only in the INSERT statement, but also in the table CREATE statement. So either mysql2sqlite has to convert BIT to BLOB in the CREATE statement, or it needs to convert the dumped value from blob (hex or string) to integer.

IMO the second solution is not good, because it is more complex to implement, and it does not reflect the BIT type correctly (BIT is meant to be a bit field, not an integer).

Let's do that. Please add comment, that we treat BIT fields as hex instead of integer, because it doesn't have the flaw of forced truncation (without truncation sqlite would silently convert the integer to float internally for storage and thus lose the bit information) and it's easy to use as blob anyway.

Seems legit :) So mysql2sqlite will have to force usage of --skip-extended-insert option of mysqldump

Same as above: mysql2sqlite will have to forbid usage of --delayed option of mysqldump

I'm glad you agree. This is also exactly the way it's been until now - require specific set of mysqldump arguments else the scenario is not supported.

dumblob avatar Mar 23 '20 14:03 dumblob