[Feature] Introduce key-value cache for paimon lookup operator in flink

Open FangYongs opened this issue 1 year ago • 1 comments

Search before asking

[X] I searched in the issues and found nothing similar.

Motivation

When we use Paimon as the source for outer key joins, it is usually necessary to lookup the source table.

For example, there are two tables

Table A(a, b, c, c1, c2, c3, c4, c5), where a is the primary key
Table B(c, d, e, e1, e2, e3, e4, e5), where c is the primary key

Now we need to perform A JOIN B on A.c = B.c to output result (a, b, c, d, e, c1, c2, c3, c4, c5, e1, e2, e3, e4, e5).

In Flink, we can convert the outer key join into a primary key join. We first perform Join on A (a, c) and B (c) to obtain the related data of (a, c), and then lookup A and B respectively based on the a and c of the related data, and finally output the resulting data. During this process, due to the delay (default 10 seconds) in loading incremental data of the Paimon dimension table, it is possible that the related data of (a, c) fails to lookup the data of A and B in a timely manner, resulting in incorrect output results.

To solve this issue, I'd like to introduce key-value cache in Paimon for lookup operator. When data is written to Paimon, it can be written to a key-value cache before the snapshot is created. And when the downstream operator get data from Paimon, it can always lookup data from key-value cache correctly.

Solution

No response

Anything else?

No response

Are you willing to submit a PR?

[ ] I'm willing to submit a PR!

May 29 '24 12:05 FangYongs

Any progress here? Very useful feature and looking forward to release

Aug 17 '24 17:08 ArthurSXL8