embulk-output-jdbc icon indicating copy to clipboard operation
embulk-output-jdbc copied to clipboard

Failed to output chinese emoji to mysql

Open YunLongND opened this issue 2 years ago • 13 comments

My mysql uses code utf8mb4, and mysq's temporary table code is utf8mb4, When I insert data into mysql temporary table, it has the following error: org.embulk.exec.PartialExecutionException: java.lang.RuntimeException: java.sql.BatchUpdateException: Incorrect string value: '\xF0\x9F\xA4\xA3' for column ......

Can someone help me solve this problem? thanks.

YunLongND avatar Apr 29 '22 09:04 YunLongND

Hello, @YunLongND

  • MySQL: 8.0.28
  • embulk: 0.9.24
  • embulk-output-mysql: 0.10.2

I could insert the following emoji into the MySQL database without any special configuration. ~~(I created the table using mysql command)~~ I also tested without a creation table.

Could you tell me more detail about your environment (configuration, database version, and so on)?

CREATE TABLE `example` (
  `id` bigint DEFAULT NULL,
  `account` bigint DEFAULT NULL,
  `time` timestamp NULL DEFAULT NULL,
  `purchase` timestamp NULL DEFAULT NULL,
  `comment` text COLLATE utf8mb4_general_ci
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci
id,account,time,purchase,comment
1,32864,2015-01-27 19:23:49,20150127,embulk
2,14824,2015-01-27 19:01:23,20150127,embulk jruby
3,27559,2015-01-28 02:20:02,20150128,"Embulk ""csv"" parser plugin"
4,11270,2015-01-29 11:54:36,20150129,🍺🍻🍣
in:
  type: file
  path_prefix: sample_
  parser:
    charset: UTF-8
    newline: LF
    type: csv
    delimiter: ','
    quote: '"'
    escape: '"'
    null_string: 'NULL'
    trim_if_not_quoted: false
    skip_header_lines: 1
    allow_extra_columns: false
    allow_optional_columns: false
    columns:
    - {name: id, type: long}
    - {name: account, type: long}
    - {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
    - {name: purchase, type: timestamp, format: '%Y%m%d'}
    - {name: comment, type: string}
out:
  type: mysql
  host: localhost
  mode: insert
  user: user
  password: password
  database: embulk_test
  table: example
mysql> show variables like '%char%';
+--------------------------+--------------------------------------------------------+
| Variable_name            | Value                                                  |
+--------------------------+--------------------------------------------------------+
| character_set_client     | utf8mb4                                                |
| character_set_connection | utf8mb4                                                |
| character_set_database   | utf8mb4                                                |
| character_set_filesystem | binary                                                 |
| character_set_results    | utf8mb4                                                |
| character_set_server     | utf8mb4                                                |
| character_set_system     | utf8mb3                                                |
| character_sets_dir       | /usr/local/Cellar/mysql/8.0.28_1/share/mysql/charsets/ |
+--------------------------+--------------------------------------------------------+
8 rows in set (0.00 sec)

hiroyuki-sato avatar May 06 '22 13:05 hiroyuki-sato

@hiroyuki-sato Thank you very much for your attention to my question. My configuration information is as follows:

  • mysql version: 5.7.35
  • embulk: 0.9.23
  • embulk-output-mysql: 0.10.2 in: type: redshift host: redshift_host user: redshift_user password: redshift_password database: redshift_database port: 5439 fetch_rows: 1000 query: select * from schema.table options: {useUnicode: true, characterEncoding: utf8} out: type: mysql host: mysql_host port: 3306 user: mysql_user password: mysql_password database: mysql_database schema: mysql_schema table: mysql_table create_table_option: DEFAULT CHARSET=utf8mb4 options: {loglevel: 2, useUnicode: true, characterEncoding: utf8} mode: truncate_insert node_count: 16

I noticed that when I use mode truncate_insert to transfer data from Redshift to MySQL, embed first creates a temporary table in mysql, then stores the data in the temporary table, and then transfers the data in the temporary table to the target table. My chinese emoji have been successfully stored in redshift, and the temporary table has been successfully created, and the code of the temporary table is utf8mb4. However, when I insert the emoji data in redshift into the temporary table of MySQL, the above error occurs. If you need more information, please let me know.

YunLongND avatar May 07 '22 06:05 YunLongND

Hello, @YunLongND

  • Did you try to insert data into the mysql_table using mysql command? (ex. 1insert into mysql_table values(1,"🍻🍣🍺") ;`
  • Did you try to insert data into another table with mode: insert?

I got the same error message if the target table does not support utf8mb4. In my case, I altered changed the database character set and recreate the table. After that, I succeed insert data.

truncate_insert does not recreate the target table. Did you check the table schema using show create table mysql_table?

I inserted the same data using mode: truncate_insert.

hiroyuki-sato avatar May 07 '22 14:05 hiroyuki-sato

@hiroyuki-sato I have checked that both my temporary table and my target table are coded utf8mb4. And using SQL directly to insert emoji into the table can be successful. Based on the above communication, we cannot confirm where the problem is, so if there is no better suggestion on this issue, let's leave it here for now. I found a little bit of a similar problem in other embulk projects, I don't know if it's related. https://github.com/embulk/embulk-input-s3/issues/75 Thank you again for your efforts.

YunLongND avatar May 07 '22 23:05 YunLongND

Hello, @YunLongND

I succeed insert emoji in the following environment and configuration.

  • embulk 0.9.24
  • embulk-output-mysql: 0.10.2
  • MySQL: 5.7.38
[mysqld]
character-set-server=utf8mb4
collation-server=utf8mb4_unicode_ci

[client]
default-character-set=utf8mb4
id,comment
2,embulk jruby
4,🍺🍻🍣
in:
  type: file
  path_prefix: sample_2
  parser:
    charset: UTF-8
    newline: LF
    type: csv
    delimiter: ','
    quote: '"'
    escape: '"'
    null_string: 'NULL'
    trim_if_not_quoted: false
    skip_header_lines: 1
    allow_extra_columns: false
    allow_optional_columns: false
    columns:
    - {name: id, type: long}
    - {name: comment, type: string}
out:
  type: mysql
  host: 127.0.0.1
  mode: truncate_insert
  user: user
  password: password
  database: embulk_test
  table: example

docker-compose.yml

   version: '3'

   services:
     db:
       image: mysql:5.7
       container_name: mysql_container
       environment:
         MYSQL_ROOT_PASSWORD: password
         MYSQL_DATABASE: embulk_test
         MYSQL_USER: user
         MYSQL_PASSWORD: password
       volumes:
       - ./docker/db/data:/var/lib/mysql
       - ./docker/db/my.cnf:/etc/mysql/conf.d/my.cnf
       restart: always
       ports:
       - 3306:3306

hiroyuki-sato avatar May 09 '22 14:05 hiroyuki-sato

I had the exact same problem, and reading the documents, it seems not possible to solve the problem only by setting connector options.

https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-conversion.html

For Connector/J 8.0.12 and earlier: In order to use the utf8mb4 character set for the connection, the server MUST be configured with character_set_server=utf8mb4; if that is not the case, when UTF-8 is used for characterEncoding in the connection string, it will map to the MySQL character set name utf8, which is an alias for utf8mb3.

setting

{ useUnicode: true, characterEncoding: UTF-8 }

using mysql-connector-java:5.1.44 would use utf8mb3.

https://stackoverflow.com/questions/44591895/utf8mb4-in-mysql-workbench-and-jdbc

Starting from MySQL Connector/J 5.1.47,

When UTF-8 is used for characterEncoding in the connection string, it maps to the MySQL character set name utf8mb4.

You can check docs here

I think upgrading the driver version to 5.1.47+ or 8.0.13+ can solve the issue.

Current solution is to set the my.cnf correctly as the document says.

rajyan avatar Aug 05 '22 02:08 rajyan

Hello, @rajyan

You can change Connecotor/J to 5.1.47+ (not Connector/J 8.x) using driver_path option. Could you try it? Could you provide the reproduce configuration/steps like this?

I think 🍣 is the four-byte UTF-8 characters.

echo -n '🍣' | od -t x1
0000000    f0  9f  8d  a3
0000004

About updating the driver.

https://github.com/embulk/embulk-input-jdbc/pull/237#issuecomment-1202003618

We'll eventually want to update JDBC drivers, but actually, JDBC drivers often have silent incompatibility between versions. We have hesitated to update the "default" JDBC driver there, then. (E.g. default options, ...) Instead, you should be able to switch the JDBC driver version with the driver_path option by yourself, without rebuilding the plugin. Please try that for a while.

See also: https://github.com/embulk/embulk-input-jdbc/issues/238

hiroyuki-sato avatar Aug 05 '22 03:08 hiroyuki-sato

Hi, @hiroyuki-sato

Thank you for noticing driver_path option!

I created a reproducible repo for the error. https://github.com/rajyan/embulk-mysql-utf8

The error occurred by setting options: { characterEncoding: UTF-8 }, ~~and using Connecotor/J to 5.1.49 didn't help. (maybe we can't change the charset to utf8mb4 by connector option?)~~ ~~Only setting mysqld --character-set-server=utf8mb4 --collation-server=utf8mb4_general_ci solved the issue.~~ edit: updated the repro correctly

rajyan avatar Aug 05 '22 11:08 rajyan

Sorry, I understood my problem and ~~maybe it was not related.~~ The unicode error happens when inserting to the intermediate table (because the default charset is not utf8mb4 the columns of the intermediate tables are not utf8mb4)

rajyan avatar Aug 05 '22 11:08 rajyan

Updated the repro by adding create_table_option: DEFAULT CHARSET=utf8mb4 https://github.com/rajyan/embulk-mysql-utf8/commit/80880d887fc6dd7b9a613e2645289dc03d30eacd

I could confirm that the error only happens in the current connector, and upgrading Connecotor/J to 5.1.49 can solve the error. ~~Although the inserted 4 byte chars are broken even using 5.1.49 ...~~

rajyan avatar Aug 05 '22 12:08 rajyan

Although the inserted 4 byte chars are broken even using 5.1.49 ...

Sorry never mind. It was just my mysql client.

I can confirm that upgrading to Connecotor/J to 5.1.49 can solve the error! 😄 https://github.com/rajyan/embulk-mysql-utf8/commit/0b32bcf8bfee08249b6c2433870ed626fa794854

rajyan avatar Aug 05 '22 12:08 rajyan

Thank you for https://github.com/embulk/embulk-input-jdbc/issues/238 and https://github.com/embulk/embulk-input-jdbc/pull/237#issuecomment-1202003618

because there are several simple work arounds/solutions like using driver_path or setting server/default charsets, I’m not that in trouble with this issue now:+1:

rajyan avatar Aug 05 '22 13:08 rajyan

Summary of the problem You'll get

org.embulk.exec.PartialExecutionException: java.lang.RuntimeException: java.sql.BatchUpdateException: Incorrect string value: '\xF0\x9F\xA4\xA3' for column ......

error if you are not setting utf8mb4 in the mysql database default charset.

This error occurs even you are setting

  • The table and column charset as utf8mb4
  • options : { useUnicode: true, characterEncoding: UTF-8 }

because Connector/J before version 5.1.47 uses utf8 (utf8mb3) as a default for the connection with options : { useUnicode: true, characterEncoding: UTF-8 }.

The solution now is

  • Setting the mysql database default charsets to utf8mb4 or
  • Install driver newer than 5.1.47 and use it by setting through driver_path
    • The default for utf8 in the connector has changed to utf8mb4 as explained above

rajyan avatar Aug 26 '22 01:08 rajyan