[Feature] Introduce key-value cache for paimon lookup operator in flink
Search before asking
- [X] I searched in the issues and found nothing similar.
Motivation
When we use Paimon as the source for outer key joins, it is usually necessary to lookup the source table.
For example, there are two tables
- Table
A(a, b, c, c1, c2, c3, c4, c5), whereais the primary key - Table
B(c, d, e, e1, e2, e3, e4, e5), wherecis the primary key
Now we need to perform A JOIN B on A.c = B.c to output result (a, b, c, d, e, c1, c2, c3, c4, c5, e1, e2, e3, e4, e5).
In Flink, we can convert the outer key join into a primary key join. We first perform Join on A (a, c) and B (c) to obtain the related data of (a, c), and then lookup A and B respectively based on the a and c of the related data, and finally output the resulting data.
During this process, due to the delay (default 10 seconds) in loading incremental data of the Paimon dimension table, it is possible that the related data of (a, c) fails to lookup the data of A and B in a timely manner, resulting in incorrect output results.
To solve this issue, I'd like to introduce key-value cache in Paimon for lookup operator. When data is written to Paimon, it can be written to a key-value cache before the snapshot is created. And when the downstream operator get data from Paimon, it can always lookup data from key-value cache correctly.
Solution
No response
Anything else?
No response
Are you willing to submit a PR?
- [ ] I'm willing to submit a PR!
Any progress here? Very useful feature and looking forward to release