seatunnel [Feature][Discuss] Uneven data distribution.

Search before asking

[X] I had searched in the feature and found no similar feature requirement.

Description

In fact, there are often abnormal data, such as partition key value range in [1, 10000] 100, 99 datas in [1-100], and 1 data is 10000.

In addition, among 100 pieces of data, 99 pieces of data have a total size of 1mb, and one piece of data has a total size of 10mb.

Can we resolve this problem?

Usage Scenario

No response

Related issues

No response

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

Code of Conduct

[X] I agree to follow this project's Code of Conduct

Sep 23 '22 08:09 mosence

It seems that we need a more personalized sharding algorithm, including the size and quantity of data, and then for different sources, we can provide a unified interface, which can be implemented by the Connector that can support this sharding algorithm.

Sep 26 '22 06:09 Hisoka-X

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

Oct 27 '22 00:10 github-actions[bot]

This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.

Nov 07 '22 00:11 github-actions[bot]

This is a good issue.Temporarily assigned to me.

Nov 28 '22 13:11 liugddx

https://shardingsphere.apache.org/document/current/en/dev-manual/sharding/

Nov 29 '22 14:11 liugddx

https://shardingsphere.apache.org/document/current/en/dev-manual/sharding/

Regarding the jdbc data source, most of the primary keys may be data such as uuid. We can't just be restricted to numeric types,hash mode is a better sharding method, of course we can provide more sharding methods.

Nov 29 '22 14:11 liugddx

https://shardingsphere.apache.org/document/current/en/dev-manual/sharding/

Regarding the jdbc data source, most of the primary keys may be data such as uuid. We can't just be restricted to numeric types,hash mode is a better sharding method, of course we can provide more sharding methods.

@ic4y @Hisoka-X Do you have any good suggestions?

Nov 29 '22 14:11 liugddx

https://shardingsphere.apache.org/document/current/en/dev-manual/sharding/

Regarding the jdbc data source, most of the primary keys may be data such as uuid. We can't just be restricted to numeric types,hash mode is a better sharding method, of course we can provide more sharding methods.

Hash maybe have some problem in single database, the each split read data may discontinuous, the preformance not good. And how to use sql to describe it also is a problem.

Nov 30 '22 02:11 Hisoka-X

https://shardingsphere.apache.org/document/current/en/dev-manual/sharding/

Regarding the jdbc data source, most of the primary keys may be data such as uuid. We can't just be restricted to numeric types,hash mode is a better sharding method, of course we can provide more sharding methods.

Hash maybe have some problem in single database, the each split read data may discontinuous, the preformance not good. And how to use sql to describe it also is a problem.

SQL needs to be in the dialect and needs to be tested and optimized for performance.

select * from test where id mod 2=1

Nov 30 '22 05:11 liugddx

https://shardingsphere.apache.org/document/current/en/dev-manual/sharding/

Regarding the jdbc data source, most of the primary keys may be data such as uuid. We can't just be restricted to numeric types,hash mode is a better sharding method, of course we can provide more sharding methods.

Hash maybe have some problem in single database, the each split read data may discontinuous, the preformance not good. And how to use sql to describe it also is a problem.

SQL needs to be in the dialect and needs to be tested and optimized for performance.
select * from test where id mod 2=1

Seem like every split will scan all data.

Nov 30 '22 05:11 Hisoka-X

https://shardingsphere.apache.org/document/current/en/dev-manual/sharding/

Regarding the jdbc data source, most of the primary keys may be data such as uuid. We can't just be restricted to numeric types,hash mode is a better sharding method, of course we can provide more sharding methods.

Hash maybe have some problem in single database, the each split read data may discontinuous, the preformance not good. And how to use sql to describe it also is a problem.

SQL needs to be in the dialect and needs to be tested and optimized for performance.
select * from test where id mod 2=1
Seem like every split will scan all data.

This is related to the storage form of the data, may or may not. But often there will be some performance loss

Nov 30 '22 06:11 ic4y

Yeah, but not all primary keys are numeric and that's something to think about

Best Regards

liugddx @.***

------------------ Original ------------------ From: @.>; Date: 2022年11月30日(星期三) 下午2:26 To: @.>; Cc: @.>; @.>; Subject: Re: [apache/incubator-seatunnel] [Feature][Discuss] Uneven data distribution. (Issue #2861)

https://shardingsphere.apache.org/document/current/en/dev-manual/sharding/

Regarding the jdbc data source, most of the primary keys may be data such as uuid. We can't just be restricted to numeric types,hash mode is a better sharding method, of course we can provide more sharding methods.

Hash maybe have some problem in single database, the each split read data may discontinuous, the preformance not good. And how to use sql to describe it also is a problem.

SQL needs to be in the dialect and needs to be tested and optimized for performance. select * from test where id mod 2=1
Seem like every split will scan all data.

This is related to the storage form of the data, may or may not. But often there will be some performance loss

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were assigned.Message ID: @.***>

Nov 30 '22 06:11 liugddx

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

Dec 31 '22 00:12 github-actions[bot]

This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.

Jan 07 '23 00:01 github-actions[bot]

seatunnel seatunnel copied to clipboard

[Feature][Discuss] Uneven data distribution.

Search before asking

Description

Usage Scenario

Related issues

Are you willing to submit a PR?

Code of Conduct

seatunnel
seatunnel copied to clipboard