seatunnel icon indicating copy to clipboard operation
seatunnel copied to clipboard

Mysql cdc duplicate synced data

Open HSLife1991 opened this issue 1 year ago • 2 comments

sync mysql data to hive by mysql cdc connector. 1.initial synced all the data and it's right in hive table; 2.changed some data in original mysql table or remove some records; 3.the dest hive contains duplicate record when change the mysql existed data;

HSLife1991 avatar Apr 17 '24 13:04 HSLife1991

mysql cdc default format will generate 2 record when upstream data updated, one record is delete one record is insert. maybe this is the reason why your data is duplicated. and if you change the format to compatible_debezium_json, it will only generate one update record. You can change the sink to Console then to check the result.

For your case, you use hive as destination, hive is not support update, delete operation. also cdc will generate a lots of small file. maybe it's not a good idea.

liunaijie avatar Apr 18 '24 06:04 liunaijie

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

github-actions[bot] avatar May 19 '24 00:05 github-actions[bot]