seatunnel icon indicating copy to clipboard operation
seatunnel copied to clipboard

[feature][CDC] The basic implementation of the CDC source Reader in the snapshot phase

Open ashulin opened this issue 2 years ago • 0 comments

Search before asking

  • [X] I had searched in the feature and found no similar feature requirement.

Description

This is a subtask of https://github.com/apache/incubator-seatunnel/issues/3175 to track completion.

Snapshot phase

snapshot-phase

The enumerator generates multiple SnapshotSplits of a table and assigns them to the reader

//  pseudo-code. 
public class SnapshotSplit implements SourceSplit {
    private final String splitId;
    private final TableId tableId;
    private final SeaTunnelRowType splitKeyType;
    private final Object splitStart;
    private final Object splitEnd;
}

When a SnapshotSplit reading is completed, the reader reports the high watermark of the split to the enumerator, When all SnapshotSplits report high watermark, the enumerator enters the incremental phase.

//  pseudo-code. 
public class CompletedSnapshotSplitReportEvent implements SourceEvent {
    private final String splitId;
    private final Offset highWatermark;
}

Snapshot phase - SnapshotSplit read flow

snapshot-read

There are 4 steps:

  1. log low watermark: get current log offset before reading snapshot data.

  2. read SnapshotSplit data: Read the range data belonging to the split

    • case 1: step 1 & 2 cannot be atomized (MySQL)

    Because we can't add table locks, and we can't add interval locks based on the low watermark, steps 1 & 2 are not atomic.

    • Exactly-once: use memory table to hold history data & filter the log data from the low to high watermark

    • At-least-once: direct output data & use low watermark instead of high watermark

    • case 2: step 1 & 2 can be atomized (Oracle)

    We can use for scn to ensure the atomicity of the two steps

    • Exactly-once: direct output data & use low watermark instead of high watermark
  3. log high watermark:

    • step 2 case 1 & Exactly-once: get current log offset after reading snapshot data.
    • other: use low watermark instead of high watermark
  4. if high > low watermark, read range log data

Snapshot phase - MySQL Snapshot Read & Exactly-once

mysql-snapshot-read

Because we can't determine where the query statement is executed between the high and low water levels, in order to ensure the exact-once of the data, we need to use the memory table to temporarily save the data.

  1. log low watermark: get current log offset before reading snapshot data.
  2. read SnapshotSplit data: read the range data belonging to the split and write to the memory table.
  3. log high watermark: get current log offset after reading snapshot data.
  4. read range log data: read log data and write to memory table
  5. output the data of the memory table and release memory usage.

Are you willing to submit a PR?

  • [ ] Yes I am willing to submit a PR!

Code of Conduct

ashulin avatar Nov 01 '22 12:11 ashulin