seatunnel [feature][CDC] The basic implementation of the CDC source Reader in the snapshot phase

[feature][CDC] The basic implementation of the CDC source Reader in the snapshot phase

Open ashulin opened this issue 2 years ago • 0 comments

Search before asking

[X] I had searched in the feature and found no similar feature requirement.

Description

This is a subtask of https://github.com/apache/incubator-seatunnel/issues/3175 to track completion.

Snapshot phase

snapshot-phase

The enumerator generates multiple SnapshotSplits of a table and assigns them to the reader

//  pseudo-code. 
public class SnapshotSplit implements SourceSplit {
    private final String splitId;
    private final TableId tableId;
    private final SeaTunnelRowType splitKeyType;
    private final Object splitStart;
    private final Object splitEnd;
}

When a SnapshotSplit reading is completed, the reader reports the high watermark of the split to the enumerator, When all SnapshotSplits report high watermark, the enumerator enters the incremental phase.

//  pseudo-code. 
public class CompletedSnapshotSplitReportEvent implements SourceEvent {
    private final String splitId;
    private final Offset highWatermark;
}

Snapshot phase - SnapshotSplit read flow

snapshot-read

There are 4 steps:

log low watermark: get current log offset before reading snapshot data.
read SnapshotSplit data: Read the range data belonging to the split
- case 1: step 1 & 2 cannot be atomized (MySQL)
Because we can't add table locks, and we can't add interval locks based on the low watermark, steps 1 & 2 are not atomic.
- Exactly-once: use memory table to hold history data & filter the log data from the low to high watermark
- At-least-once: direct output data & use low watermark instead of high watermark
- case 2: step 1 & 2 can be atomized (Oracle)
We can use for scn to ensure the atomicity of the two steps
- Exactly-once: direct output data & use low watermark instead of high watermark
log high watermark:
- step 2 case 1 & Exactly-once: get current log offset after reading snapshot data.
- other: use low watermark instead of high watermark
if high > low watermark, read range log data

Snapshot phase - MySQL Snapshot Read & Exactly-once

mysql-snapshot-read

Because we can't determine where the query statement is executed between the high and low water levels, in order to ensure the exact-once of the data, we need to use the memory table to temporarily save the data.

log low watermark: get current log offset before reading snapshot data.
read SnapshotSplit data: read the range data belonging to the split and write to the memory table.
log high watermark: get current log offset after reading snapshot data.
read range log data: read log data and write to memory table
output the data of the memory table and release memory usage.

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

Code of Conduct

[X] I agree to follow this project's Code of Conduct

Nov 01 '22 12:11 ashulin

seatunnel seatunnel copied to clipboard

[feature][CDC] The basic implementation of the CDC source Reader in the snapshot phase

Search before asking

Description

Snapshot phase

Snapshot phase - SnapshotSplit read flow

Snapshot phase - MySQL Snapshot Read & Exactly-once

Are you willing to submit a PR?

Code of Conduct

seatunnel
seatunnel copied to clipboard