paimon icon indicating copy to clipboard operation
paimon copied to clipboard

[Flink] Update using partial columns and support aggregation merge-engine for updates.

Open hzjhjjyy opened this issue 9 months ago • 5 comments

Purpose

FLINK-32001 has been resolved, so partial columns can now be used for updates.

Different merge engines have different requirements(aggregation has also been supported for updates):

  1. deduplicate: only full columns can be used.
  2. partial-update: in addition to the update columns, appended columns include primary keys, partition keys, sequence field, columns defining the last_value in aggregation, and the sequence-group for update columns.
  3. aggregation: compared to partial-update, it lacks the sequence-group for update columns.

Tests

BatchUpdateWithPartialColumnsITCase (all aggregations, sequence field/group and changelog have been tested)

API and Format

Documentation

hzjhjjyy avatar May 06 '24 10:05 hzjhjjyy

Hi @hzjhjjyy , is this come from user requirement? I understand this optimization, but I'm hesitant to move forward with it.

JingsongLi avatar May 11 '24 05:05 JingsongLi

Hi @hzjhjjyy , is this come from user requirement? I understand this optimization, but I'm hesitant to move forward with it.

I think this is to improve the efficiency and scope of update. Naturally, I’m also willing to decide how to handle this pr based on your advice.

hzjhjjyy avatar May 11 '24 06:05 hzjhjjyy

Hi @hzjhjjyy , is this come from user requirement? I understand this optimization, but I'm hesitant to move forward with it.

I think this is to improve the efficiency and scope of update. Naturally, I’m also willing to decide how to handle this pr based on your advice.

Hi @hzjhjjyy for your inputs:

  1. efficiency of update: In updates, the biggest consumption is twofold: first, discovering this data from the file; second, rewriting the file or using MOR technology. The optimization effect of some updates is not significant.
  2. scope of update: I get this can support FieldLastValueAgg, but the default is FieldLastNonNullValueAgg.

Considering these two points, and the changes made by this PR to the current topology are not very worthwhile.

JingsongLi avatar May 14 '24 01:05 JingsongLi

Hi @JingsongLi . My own understanding of this pr:

  1. For Paimon, this pr is indeed no optimization regarding the calculation method for partial update. The optimization focuses only on reducing the fields retrieved and transmitted when rewriting sql for updates to selects on the flink side. Of course, this has some optimization for large tables since updates typically don't involve many fields simultaneously.
  2. Currently, updates are only provided for deduplication and partial-update. Considering the similarity between partial update and aggregation in agg functions, support for aggregation has been added (otherwise, full-field updates wouldn't be supported). In my description above, I specifically mentioned last_value just because of its uniqueness, hence the separate mention of its special treatment in this pr. Last_non_null_value can be implemented without special treatment.

Overall, this pr aims to support the feature of using partial columns in updates in flink. I wonder if my explanation clarifies and captures your intent?

hzjhjjyy avatar May 14 '24 02:05 hzjhjjyy

Hi @JingsongLi . My own understanding of this pr:

  1. For Paimon, this pr is indeed no optimization regarding the calculation method for partial update. The optimization focuses only on reducing the fields retrieved and transmitted when rewriting sql for updates to selects on the flink side. Of course, this has some optimization for large tables since updates typically don't involve many fields simultaneously.
  2. Currently, updates are only provided for deduplication and partial-update. Considering the similarity between partial update and aggregation in agg functions, support for aggregation has been added (otherwise, full-field updates wouldn't be supported). In my description above, I specifically mentioned last_value just because of its uniqueness, hence the separate mention of its special treatment in this pr. Last_non_null_value can be implemented without special treatment.

Overall, this pr aims to support the feature of using partial columns in updates in flink. I wonder if my explanation clarifies and captures your intent?

Yes, I got your point, but my point is just "Is our modification worth it", we can wait for these requirements to emerge.

JingsongLi avatar May 14 '24 04:05 JingsongLi

Close this now to wait for more requirements.

JingsongLi avatar Aug 12 '24 02:08 JingsongLi