Unicode/Chinese replication issue.
Describe the bug A clear and concise description of what the bug is, with verbose log.
This is happening when i tried to replicate a table with Chinese character, from PostgreSQL to MSSQL.
What i saw in PostgreSQL
id col_chi
1 陳大文 2 吳小明
What i saw after replicated to MSSQL using replicaDB
id col_chi
1 ??? 2 ???
If i perform insert select from PostgreSQL via linked server from MSSQL insert into test_chinese_1 select * from openquery(link_postgres, 'select * from public.test_chinese_1;')
id col_chi
1 陳大文 2 吳小明
Verbose log [~]$ /replicadb/bin/replicadb -v --options-file /replicadb/conf/_replicadb_bis.conf --source-table public.test_chinese_1 --sink-table dbo.test_chinese_1 --mode complete-atomic Picked up JAVA_TOOL_OPTIONS: -Djdbc.drivers=org.postgresql.Driver -Dfile.encoding=UTF8 -Dclient.encoding.override=GBK -Dpostgresql.enable_sspi=true -Duser.language=zh -Duser.country=TW 2023-08-01 10:17:04,871 INFO ReplicaDB:63 Running ReplicaDB version: 0.15.0 2023-08-01 10:17:04,877 INFO ReplicaDB:66 Setting verbose mode INFO 2023-08-01 10:17:05,268 INFO SQLServerManager:133 Creating staging table with this command: SELECT * INTO staging.test_chinese_1repdb010 FROM dbo.test_chinese_1 WHERE 0 = 1 2023-08-01 10:17:05,272 INFO SqlManager:388 Atomic and asynchronous deletion of all data from the sink table with this command: DELETE FROM dbo.test_chinese_1 2023-08-01 10:17:05,274 INFO ReplicaTask:35 Starting TaskId-0 2023-08-01 10:17:05,443 INFO SqlManager:128 TaskId-0: Executing SQL statement: SELECT * FROM public.test_chinese_1 OFFSET ? 2023-08-01 10:17:05,451 INFO SqlManager:148 TaskId-0: With args: 0, 2023-08-01 10:17:05,524 WARN ConnManager:188 Options source-columns and sink-columns are null, getting from Source ResultSetMetaData: id,col_chi 2023-08-01 10:17:05,524 INFO ReplicaTask:67 A total of 0 rows processed by task 0 2023-08-01 10:17:05,526 INFO ReplicaDB:120 Waiting for the asynchronous task to be completed... 2023-08-01 10:17:05,526 INFO SQLServerManager:50 IF OBJECTPROPERTY(OBJECT_ID('dbo.test_chinese_1'), 'TableHasIdentity') = 1 SET IDENTITY_INSERT dbo.test_chinese_1 ON 2023-08-01 10:17:05,526 INFO SqlManager:430 Inserting data from staging table to sink table within a transaction: INSERT INTO dbo.test_chinese_1 (id,col_chi) SELECT id,col_chi FROM staging.test_chinese_1repdb010 2023-08-01 10:17:05,528 INFO SQLServerManager:50 IF OBJECTPROPERTY(OBJECT_ID('dbo.test_chinese_1'), 'TableHasIdentity') = 1 SET IDENTITY_INSERT dbo.test_chinese_1 OFF 2023-08-01 10:17:05,529 INFO SqlManager:462 Dropping staging table with this command: DROP TABLE staging.test_chinese_1repdb010 2023-08-01 10:17:05,531 INFO ReplicaDB:54 Total process time: 668ms
To Reproduce Steps to reproduce the behaviour:
-
Source table DDL PostgreSQL: create table public.test_chinese_1 ( id serial, col_chi CHARACTER VARYING(10) ); insert into public.test_chinese_1 (col_chi) values ('陳大文'); insert into public.test_chinese_1 (col_chi) values ('吳小明');
-
Sink table DDL create table test_chinese_1( id int, col_chi nvarchar(10) )
-
ReplicaDB configuration
options-file.
jobs=1 fetch.size=1000 source.connect=jdbc:postgresql://postger_server:5432/nih?useUnicode=yes&characterEncoding=UTF8 source.user= source.password=
# source.connect.parameter.useUnicode=yes# source.connect.parameter.characterEncoding=UTF8
sink.connect=jdbc:sqlserver://mssql_server:1433;database=DM_GM_BIS_REPL;useUnicode=true;characterEncoding=UTF8 sink.user= sink.password=
# sink.connect.parameter.[parameter_name]=parameter_value# sink.connect.parameter.useUnicode=true# sink.connect.parameter.characterEncoding=UTF8
Expected behavior A clear and concise description of what you expected to happen. I tried different way to try enforcing JDBC using unicode set and UTF8/Big5/GBK encoding but i still not able to keep the Chinese characters.
Additional context Add any other context about the problem here. Running environment (cloud, on premise, java version..), source and sink technologies (Oracle, MySQL, Postgres...) Also tried JDBC_TOOL_OPTIONS on OS level.
export JAVA_TOOL_OPTIONS="-Dfile.encoding=UTF8 -Dclient.encoding.override=GBK"