tiflow icon indicating copy to clipboard operation
tiflow copied to clipboard

Always splitting update events if partition key changes

Open CharlesCheung96 opened this issue 8 months ago • 0 comments

Background

When using the index-value or columns dispatcher to distribute data across different Kafka partitions based on the key, multiple consumer processes in the downstream consumer group consume Kafka topic partitions independently. Due to different consumption progress, data inconsistency might occur. Take the following SQL as an example:

CREATE TABLE t (a INT PRIMARY KEY, b INT);

INSERT INTO t VALUES (1, 2);
UPDATE t SET a = 2 WHERE a = 1;

INSERT INTO t VALUES (1, 3);
UPDATE t SET a="3" WHERE a="1";

If the UPDATE event is not split (ref #11211), and use the index-value or columns dispatcher:

# index 
dispatchers = [
    {matcher = ['test.t'], partition = "index-value"},

]

# columns
dispatchers = [
    {matcher = ['test2.*'], partition = "columns", columns = ["a"]}
]

THEN, the preceding DML will be distributed to the following partitions (change events contain both new and old values):

Event Seq partition-1 partition-2 partition-3
1 INSERT a = 1, b = 2 UPDATE a = 2 WHERE a = 1; UPDATE a = 3 WHERE a = 1;
2 INSERT a = 1, b = 3    

Since partitions are consumed in parallel, different consumption sequences will correspond to different results, in which only one consumption sequence can guarantee the final consistency (p1-1, p2-1, p1-2, p3-1).

Solution

Note that although the correctness of the index dispatcher could be guaranteed by setting output-raw-change-event to false, the column dispatcher still relies on this solution to ensure the correctness of the distribution behavior.

Design

TiCDC should always split the UPDATE event, which changes the partition key, into DELETE and INSERT events. And the preceding DML events will be distributed to the following partitions:

Event Seq partition-1 partition-2 partition-3
1 INSERT a = 1, b = 2 INSERT a = 2, b = 2 INSERT a = 3, b = 3
2 DELETE a = 1    
3 INSERT a = 1, b = 3    
4 DELETE a = 1    

Compatibility

In order to be compatible with the historical version and to ensure that the user's behavior does not change after upgrading, we should add a parameter split-update-partition-key and set it to false by default.

CharlesCheung96 avatar Jun 12 '24 02:06 CharlesCheung96