paimon icon indicating copy to clipboard operation
paimon copied to clipboard

[core] Add support of sequence field in first row

Open Aitozi opened this issue 1 year ago • 1 comments

Purpose

This PR is meant to support set sequence.field for first row. For example, user may want to keep the first row in Eventtime semantics (Otherwise, the result may not correct). In this mode, we will generate UB message to retract old value.

Also support generating deletion vector for first row with sequence field.

Tests

API and Format

Documentation

Aitozi avatar Apr 26 '24 03:04 Aitozi

CC @JingsongLi

Aitozi avatar May 06 '24 03:05 Aitozi

Hi @Aitozi , thanks for your contribution!

Semantically speaking, supporting sequence in FIRST-ROW is very good, and it is this semantics.

However, there are two issues:

  1. Users will experience a significant performance gap, with significant differences in streaming read, streaming write, and batch reading, which can greatly affect tuning. Therefore, from this perspective, it is better to let them clearly use another approach.
  2. There are many additional judgments in the code, and the previous assumptions about FISRT-ROW have been broken one by one, which will increase the complexity of the code.

Here, I recommend another implementation method by introducing an option: sequence.field.reverse, flipping the comparison of sequences. The advantage of this is that it is relatively easy to modify. We can remind users of exception when using FIRST-ROW with sequence.

If there are not many scenes, it is still recommended to use this low-cost approach, but if you think there are many scenes, we can tolerate these changes to FIRST-ROW.

JingsongLi avatar May 07 '24 11:05 JingsongLi

Hi @Aitozi , thanks for your contribution!

Semantically speaking, supporting sequence in FIRST-ROW is very good, and it is this semantics.

However, there are two issues:

  1. Users will experience a significant performance gap, with significant differences in streaming read, streaming write, and batch reading, which can greatly affect tuning. Therefore, from this perspective, it is better to let them clearly use another approach.
  2. There are many additional judgments in the code, and the previous assumptions about FISRT-ROW have been broken one by one, which will increase the complexity of the code.

Here, I recommend another implementation method by introducing an option: sequence.field.reverse, flipping the comparison of sequences. The advantage of this is that it is relatively easy to modify. We can remind users of exception when using FIRST-ROW with sequence.

If there are not many scenes, it is still recommended to use this low-cost approach, but if you think there are many scenes, we can tolerate these changes to FIRST-ROW.

Hi @JingsongLi , thanks for your valuable inputs. sequence.field.reverse looks good to me, will try this solution

Aitozi avatar May 08 '24 01:05 Aitozi