paimon icon indicating copy to clipboard operation
paimon copied to clipboard

[Feature] When the MergeEngine of the dimension table is aggregation / partial-update, there is no need to forcibly enable changelog-producer.

Open liming30 opened this issue 1 year ago • 4 comments

Search before asking

  • [X] I searched in the issues and found nothing similar.

Motivation

Currently, two built-in compaction strategies are provided, but they are mainly set automatically based on MergeEngine / ChangelogProducer / DeletionVectors. In the scenario of using dimension tables, if the MergeEngine of the dimension table is aggregation/partial-update, we have to set CHANGELOG_PRODUCER to lookup. But when using PrimaryKeyPartialLookupTable, changelog is useless.

Therefore, I hope to add a compact-strategy configuration, so that CHANGELOG_PRODUCER can be enabled only when necessary.

At the same time, for FullCacheLookupTable, even if the user does not enable CHANGELOG_PRODUCER, we can also obtain the changelog through IncrementalDiffSplitRead.

Solution

I would like to do it in two parts:

  1. add the compact-strategy configuration, so that other types of MergeEngine can be compacted quickly without writing changelog.

  2. adjust the refresh strategy of FullCacheLookupTable to support streaming updates of other types of MergeEngine without enable CHANGELOG_PRODUCER.

Anything else?

No response

Are you willing to submit a PR?

  • [X] I'm willing to submit a PR!

liming30 avatar Aug 01 '24 09:08 liming30

What case do you want to solve? Lookup Join for partial-update table without changelog-producer? If this is your requirement, can we just modify Flink LookupJoin Function?

JingsongLi avatar Aug 06 '24 09:08 JingsongLi

What case do you want to solve? Lookup Join for partial-update table without changelog-producer? If this is your requirement, can we just modify Flink LookupJoin Function?

@JingsongLi the dim table usually has no streaming consumption jobs, so generating a changelog is useless. In most cases, we will ensure that the primary key of the dim table is the same as the key of the lookup-join, so we will use PrimaryKeyPartialLookupTable to perform the lookup.

By adding a prefer-compaction-strategy, I think most of the code can be reused. If we perform the merge operation in the LookupJoin Function, this will result in a merge overhead in each job.

liming30 avatar Aug 06 '24 12:08 liming30

Hi @liming30 , so what you want to support is just for partial-update&agg table with lookup but without changelog?

Why? The most cost is in lookup, the cost of changelog is not so high.

JingsongLi avatar Sep 03 '24 07:09 JingsongLi

As an issue following #3905 , dim tables do not require streaming consumption in most cases, so there is no need to generate changelog files to reduce write IO.

When the primary key of the lookup join is exactly the same as the primary key of the table, we can use PrimaryKeyPartialLookupTable without reading the changelog file. When the primary key of the lookup join is inconsistent with the primary key of the table, we can use FullCacheLookupTable based on the diff generated by compaction, so I hope to relax the restrictions on FullCacheLookupTable for dim tables.

liming30 avatar Sep 03 '24 08:09 liming30