seatunnel icon indicating copy to clipboard operation
seatunnel copied to clipboard

[Feature]Support CDC

Open CalvinKirs opened this issue 2 years ago • 9 comments

Change data capture (CDC) refers to the process of identifying and capturing changes made to data in a database and then delivering those changes in real-time to a downstream process or system.

CDC is mainly divided into two ways: query-based and Binlog-based. We know that MySQL has binlog (binary log) to record the user's changes to the database, so it is logical that one of the simplest and most efficient CDC implementations can be done using binlog. Of course, there are already many open source MySQL CDC implementations that work out of the box. Using binlog is not the only way to implement CDC (at least for MySQL), even database triggers can perform similar functions, but they may be dwarfed in terms of efficiency and impact on the database.

Typically, after a CDC captures changes to a database, it will publish the change events to a message queue for consumers to consume, such as Debezium, which persists MySQL (and also supports PostgreSQL, Mongo, etc.) changes to Kafka, and by subscribing to the events in Kafka, we can get the content of the changes and implement the functionality we need.

And as data synchronization, I think we need to support CDC as a feature, and I want to hear from you all how you think it can be implemented in SeaTunnel.

CalvinKirs avatar Aug 10 '22 08:08 CalvinKirs

I'm going to use DeBezium to get the incremental real-time bin log, pass it to the API of the new Connector, save the source status and read back。

2013650523 avatar Aug 12 '22 08:08 2013650523

I'm going to use DeBezium to get the incremental real-time bin log, pass it to the API of the new Connector, save the source status and read back。

Any detailed designs can be put here, we'll discuss them first.

CalvinKirs avatar Aug 12 '22 09:08 CalvinKirs

We can implement this based on FlinkCDC, configure through conf, and dynamically generate DDL SQL for Source and Sink.

guanboo avatar Aug 12 '22 18:08 guanboo

What is the community's plan to achieve CDC without relying on any framework such as flink with pure java?

TyrantLucifer avatar Aug 14 '22 05:08 TyrantLucifer

We can implement this based on FlinkCDC, configure through conf, and dynamically generate DDL SQL for Source and Sink.

FlinkCDC base on Flink, And our connector needs to be independent of the engine

CalvinKirs avatar Aug 14 '22 10:08 CalvinKirs

We can implement it based on Debezium and Netflix's DBLog parallel algorithm

ashulin avatar Aug 19 '22 12:08 ashulin

Implement CDC data synchronization hudi based on Debezium Server

https://github.com/apache/hudi/issues/6853

@CalvinKirs

melin avatar Oct 08 '22 09:10 melin

We can implement it based on Debezium and Netflix's DBLog parallel algorithm

  1. Supports multi-table and sharding (Easy Configuration)
  2. Supports parallel reading of historical data (Fast synchronization, billions of large table)
  3. Supports reading incremental data (CDC)
  4. Support heartbeat detection (metrics, small traffic table)
  5. Support for dynamically adding new tables (Easier to operate and maintain)
  6. Support Schema evolution(DDL)

ashulin avatar Oct 12 '22 02:10 ashulin

I have several opinions on the necessity and realization of CDC:

Necessity: As a new generation data integration platform, CDC is very necessary; Because in actual use, the demand for change flow capture in the enterprise is gradually increasing. The rapid development of Flink CDC is a good example. --If Sea Tunnel does not strengthen its ability to process CDC, it will be faced with the problem of using additional CDC processing tools after using ST to process batch data. Like Canal, Debezium, FlinkCDC.

In addition, the change flow capture at the bottom of FlinkCDC also uses Debezium

Several implementation concerns: ---Engine > Connector. Common databases with many requirements, such as MySQL and Oracle. Their CDC implementation schemes are roughly the same, all based on Binlog. For example: Flink Oracle CDCs use Debezium for CDC content collection, while Debezium uses a Logmienr based solution. StreamSets' processing of Oracle is also based on Logminer. Therefore, the priority of CDC content collection should be lowered, and the design of the processing engine should be considered first. This may include the unified process of CDC processing, such as consistency, breakpoint, batch flow connection, fault tolerance, failover, and other issues that should be handled uniformly. In this part, FlinkCDC is worthy of reference.

Sinse Sea Tunnel 2.3.0, it has its own computing engine, which is the cornerstone of processing CDC (most of the time, when the outlet of the CDC stream is a single thread, the processing does not need to be distributed, so it does not rely on computing engines such as Flink or Spark).

---Format compatibility. In most cases, when Sea Tunnel has CDC processing capabilities, it will need to process messages sent to Kafka from other CDC tools. Therefore, compatibility with some common formats, such as Flink CDC, should also be considered. In other words, when designing its format, Sea Tunnel should be designed independently to ensure rapid development or compatible with common component formats in the market, which is also worth considering.

cason0126 avatar Nov 10 '22 15:11 cason0126

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

github-actions[bot] avatar Mar 24 '23 00:03 github-actions[bot]

This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.

github-actions[bot] avatar Mar 31 '23 00:03 github-actions[bot]