paimon icon indicating copy to clipboard operation
paimon copied to clipboard

[Feature] Added sharded reading

Open scottxing opened this issue 1 year ago • 0 comments

Search before asking

  • [X] I searched in the issues and found nothing similar.

Motivation

When a large amount of data passes through paimon cdc, about 100 million records are dropped to the paimon ods table. The table attribute sets changelog as input. Then, at this time, I write a flink sql job (using the consumer-id setting), and read This table is inserted into another paimon dwd table (the changelog attribute is lookup). After starting this job, the checkpoint has been stuck at 0% and cannot be completed, so the snapshot cannot be committed. As a result, my other flink sql job cannot check the paimon dwd table. to data. This leads to the phenomenon that a large amount of data from one paimon table must be completely written to another paimon table before it can then be transferred from this paimon table to the next. Data cannot flow smoothly from job to job like a stream.

Solution

Added sharded reading. For large-volume paimon tables, when the job reads, sharding is set up, similar to Flink CDC. After one shard is completed, the next shard is moved on to ensure smooth checkpointing. Let data flow between various paimon tables.

No response

Anything else?

No response

Are you willing to submit a PR?

  • [ ] I'm willing to submit a PR!

scottxing avatar Apr 30 '24 01:04 scottxing