seatunnel
seatunnel copied to clipboard
[feature][CDC] The basic implementation of the CDC source Reader in the snapshot phase
Search before asking
- [X] I had searched in the feature and found no similar feature requirement.
Description
This is a subtask of https://github.com/apache/incubator-seatunnel/issues/3175 to track completion.
Snapshot phase

The enumerator generates multiple SnapshotSplits of a table and assigns them to the reader
// pseudo-code.
public class SnapshotSplit implements SourceSplit {
private final String splitId;
private final TableId tableId;
private final SeaTunnelRowType splitKeyType;
private final Object splitStart;
private final Object splitEnd;
}
When a SnapshotSplit reading is completed, the reader reports the high watermark of the split to the enumerator,
When all SnapshotSplits report high watermark, the enumerator enters the incremental phase.
// pseudo-code.
public class CompletedSnapshotSplitReportEvent implements SourceEvent {
private final String splitId;
private final Offset highWatermark;
}
Snapshot phase - SnapshotSplit read flow

There are 4 steps:
-
log low watermark: get current log offset before reading snapshot data.
-
read SnapshotSplit data: Read the range data belonging to the split
- case 1: step 1 & 2 cannot be atomized (MySQL)
Because we can't add table locks, and we can't add interval locks based on the low watermark, steps 1 & 2 are not atomic.
-
Exactly-once: use memory table to hold history data & filter the log data from the low to high watermark
-
At-least-once: direct output data & use low watermark instead of high watermark
-
case 2: step 1 & 2 can be atomized (Oracle)
We can use
for scnto ensure the atomicity of the two steps- Exactly-once: direct output data & use low watermark instead of high watermark
-
log high watermark:
step 2 case 1 & Exactly-once: get current log offset after reading snapshot data.other: use low watermark instead of high watermark
-
if high > low watermark, read range log data
Snapshot phase - MySQL Snapshot Read & Exactly-once

Because we can't determine where the query statement is executed between the high and low water levels, in order to ensure the exact-once of the data, we need to use the memory table to temporarily save the data.
- log low watermark: get current log offset before reading snapshot data.
- read SnapshotSplit data: read the range data belonging to the split and write to the memory table.
- log high watermark: get current log offset after reading snapshot data.
- read range log data: read log data and write to memory table
- output the data of the memory table and release memory usage.
Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct