paimon icon indicating copy to clipboard operation
paimon copied to clipboard

[core] Introduce the BucketFunction interface

Open Aitozi opened this issue 9 months ago • 9 comments

Purpose

Linked issue: close #5444

Tests

API and Format

Documentation

Aitozi avatar Apr 11 '25 01:04 Aitozi

CC @JingsongLi please take a look when you are free.

BTW, I'm confused the CI compile error do not occur in my local

image

Aitozi avatar Apr 11 '25 07:04 Aitozi

CC @JingsongLi please take a look when you are free.

BTW, I'm confused the CI compile error do not occur in my local

image

This seems the bug for the maven 3.9.9. I add the paimon-common dependency to work around this.

Aitozi avatar Apr 12 '25 07:04 Aitozi

PR that is also useful for me

Pandas886 avatar Apr 22 '25 02:04 Pandas886

PR that is also useful for me

@Pandas886 May I ask what's your use case?

Aitozi avatar Apr 22 '25 03:04 Aitozi

PR that is also useful for me

@Pandas886 May I ask what's your use case?

Currently, Paimon is being integrated into our internal data pipeline tool. When writing to Paimon, if we want to support multiple parallel writes to fixed bucket tables, it requires shuffling by bucket key, with each writer writing data for its own bucket. However, within the data pipeline, the transform phase has already converted the data to an internal format, making it impossible to call the Paimon SDK to retrieve the bucket key.

Pandas886 avatar Apr 22 '25 06:04 Pandas886

resolved conflict. Please take another look again. @JingsongLi

cc @Zouxxyy @YannByron This PR also add a new parameter (hashType) for the spark bucket function

Aitozi avatar Apr 23 '25 02:04 Aitozi

I prefer to provide a BUCKET FUNCTION instead HASH FUNCTION. Now compute a bucket is: Math.abs(hashcode % numBuckets). I'm not sure if it's universal enough, but BucketFunction is definitely universal enough.

Thanks, +1 for your suggestion.

Aitozi avatar Apr 23 '25 13:04 Aitozi

cc @Zouxxyy to take a look to spark part.

JingsongLi avatar Apr 24 '25 06:04 JingsongLi

Maybe you should change PaimonScan too?

@JingsongLi Yes, I think we should distinguish in different bucket function. So, we have to introduce new bucket transformer for this. I disable the bucket scan for the other bucket function type now.

Aitozi avatar Apr 27 '25 14:04 Aitozi

Please take a look again CC @JingsongLi @Zouxxyy

Aitozi avatar Jun 10 '25 00:06 Aitozi

@luoyuxia can you also help take a look ?

Aitozi avatar Jun 10 '25 03:06 Aitozi

@Zouxxyy I have addressed your comments, please take a look again.

Aitozi avatar Jun 17 '25 16:06 Aitozi