seatunnel icon indicating copy to clipboard operation
seatunnel copied to clipboard

[Feature][Connector-V2 E2E] Data consistency test process design

Open EricJoy2048 opened this issue 2 years ago • 2 comments

Search before asking

  • [X] I had searched in the feature and found no similar feature requirement.

Description

We need support data consistency test in connector v2 e2e. I have some idea about it and welcome everyone to discuss.

Test Sink Connector

Fake Source Connector

If we want to test the data consistency of a sink connector, We can use the Fake Source connector. The Fake Source Connector support define row numbers and Primary key fields in the feature. Defile Primary key fields is useful to test exactly-once sink which implement exactly-once by Idempotent write data. If we can simulate task failure and then restore task, We can complete the data consistency test.

How to simulate task failure and restore task.

I think we can use Fake Source connector to simulate task failure too. We can add some active triggering failure function in Fake Source. To ensure Fake Source can support read playback, the Fake Source need support snapshot too.

How to check data

We can check the rows that wrote in sink.

Test Source Connector

Test JDBC Sink Connector

If we want to test the data consistency of a source connector, We can add a Test JDBC Sink connector. It need support exactly-once.

How to simulate task failure and restore task.

We can add some active triggering failure function in Test JDBC Sink connector.

How to check data

There are two ways to do it.

First one: After the job xxxSource -> TestJDBCSink finished, we can automatically create a job JDBCSource -> AssertSink and use AssertSink to check data.

shortcoming This way need run two jobs.

advantage This way can do test standardization, people only need config the check rules in AssertSink connector.

The second one is add a java program to check data in MySQL/PG.

shortcoming This way can not do test standardization, every source connector e2e need add the check program and define the check rules themselves.

advantage only need run one job.

mail list: https://lists.apache.org/thread/148v3w2tbz8byxwnwbk46mkgzoj600w5

Usage Scenario

No response

Related issues

No response

Are you willing to submit a PR?

  • [ ] Yes I am willing to submit a PR!

Code of Conduct

EricJoy2048 avatar Sep 20 '22 03:09 EricJoy2048

@getChan @2013650523 @531651225 @laglangyue @legendtkl @leo65535 @lhyundeadsoul @hailin0 @Hisoka-X @ashulin @ic4y @TyrantLucifer @iture123 and all people who may be interested in this question, Do you have any suggestions?

EricJoy2048 avatar Sep 20 '22 03:09 EricJoy2048

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

github-actions[bot] avatar Oct 21 '22 00:10 github-actions[bot]

This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.

github-actions[bot] avatar Nov 16 '22 00:11 github-actions[bot]