seatunnel icon indicating copy to clipboard operation
seatunnel copied to clipboard

[Feature][Discuss] Uneven data distribution.

Open mosence opened this issue 2 years ago • 13 comments

Search before asking

  • [X] I had searched in the feature and found no similar feature requirement.

Description

In fact, there are often abnormal data, such as partition key value range in [1, 10000] 100, 99 datas in [1-100], and 1 data is 10000.

In addition, among 100 pieces of data, 99 pieces of data have a total size of 1mb, and one piece of data has a total size of 10mb.

Can we resolve this problem?

Usage Scenario

No response

Related issues

No response

Are you willing to submit a PR?

  • [ ] Yes I am willing to submit a PR!

Code of Conduct

mosence avatar Sep 23 '22 08:09 mosence

It seems that we need a more personalized sharding algorithm, including the size and quantity of data, and then for different sources, we can provide a unified interface, which can be implemented by the Connector that can support this sharding algorithm.

Hisoka-X avatar Sep 26 '22 06:09 Hisoka-X

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

github-actions[bot] avatar Oct 27 '22 00:10 github-actions[bot]

This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.

github-actions[bot] avatar Nov 07 '22 00:11 github-actions[bot]

This is a good issue.Temporarily assigned to me.

liugddx avatar Nov 28 '22 13:11 liugddx

https://shardingsphere.apache.org/document/current/en/dev-manual/sharding/

liugddx avatar Nov 29 '22 14:11 liugddx

https://shardingsphere.apache.org/document/current/en/dev-manual/sharding/

Regarding the jdbc data source, most of the primary keys may be data such as uuid. We can't just be restricted to numeric types,hash mode is a better sharding method, of course we can provide more sharding methods.

liugddx avatar Nov 29 '22 14:11 liugddx

https://shardingsphere.apache.org/document/current/en/dev-manual/sharding/

Regarding the jdbc data source, most of the primary keys may be data such as uuid. We can't just be restricted to numeric types,hash mode is a better sharding method, of course we can provide more sharding methods.

@ic4y @Hisoka-X Do you have any good suggestions?

liugddx avatar Nov 29 '22 14:11 liugddx

https://shardingsphere.apache.org/document/current/en/dev-manual/sharding/

Regarding the jdbc data source, most of the primary keys may be data such as uuid. We can't just be restricted to numeric types,hash mode is a better sharding method, of course we can provide more sharding methods.

Hash maybe have some problem in single database, the each split read data may discontinuous, the preformance not good. And how to use sql to describe it also is a problem.

Hisoka-X avatar Nov 30 '22 02:11 Hisoka-X

https://shardingsphere.apache.org/document/current/en/dev-manual/sharding/

Regarding the jdbc data source, most of the primary keys may be data such as uuid. We can't just be restricted to numeric types,hash mode is a better sharding method, of course we can provide more sharding methods.

Hash maybe have some problem in single database, the each split read data may discontinuous, the preformance not good. And how to use sql to describe it also is a problem.

SQL needs to be in the dialect and needs to be tested and optimized for performance.

select * from test where id mod 2=1

liugddx avatar Nov 30 '22 05:11 liugddx

https://shardingsphere.apache.org/document/current/en/dev-manual/sharding/

Regarding the jdbc data source, most of the primary keys may be data such as uuid. We can't just be restricted to numeric types,hash mode is a better sharding method, of course we can provide more sharding methods.

Hash maybe have some problem in single database, the each split read data may discontinuous, the preformance not good. And how to use sql to describe it also is a problem.

SQL needs to be in the dialect and needs to be tested and optimized for performance.

select * from test where id mod 2=1

Seem like every split will scan all data.

Hisoka-X avatar Nov 30 '22 05:11 Hisoka-X

https://shardingsphere.apache.org/document/current/en/dev-manual/sharding/

Regarding the jdbc data source, most of the primary keys may be data such as uuid. We can't just be restricted to numeric types,hash mode is a better sharding method, of course we can provide more sharding methods.

Hash maybe have some problem in single database, the each split read data may discontinuous, the preformance not good. And how to use sql to describe it also is a problem.

SQL needs to be in the dialect and needs to be tested and optimized for performance.

select * from test where id mod 2=1

Seem like every split will scan all data.

This is related to the storage form of the data, may or may not. But often there will be some performance loss

ic4y avatar Nov 30 '22 06:11 ic4y

Yeah, but not all primary keys are numeric and that's something to think about

 

Best Regards

liugddx @.***

 

------------------ Original ------------------ From: @.>; Date: 2022年11月30日(星期三) 下午2:26 To: @.>; Cc: @.>; @.>; Subject: Re: [apache/incubator-seatunnel] [Feature][Discuss] Uneven data distribution. (Issue #2861)

https://shardingsphere.apache.org/document/current/en/dev-manual/sharding/

Regarding the jdbc data source, most of the primary keys may be data such as uuid. We can't just be restricted to numeric types,hash mode is a better sharding method, of course we can provide more sharding methods.

Hash maybe have some problem in single database, the each split read data may discontinuous, the preformance not good. And how to use sql to describe it also is a problem.

SQL needs to be in the dialect and needs to be tested and optimized for performance. select * from test where id mod 2=1
Seem like every split will scan all data.

This is related to the storage form of the data, may or may not. But often there will be some performance loss

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were assigned.Message ID: @.***>

liugddx avatar Nov 30 '22 06:11 liugddx

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

github-actions[bot] avatar Dec 31 '22 00:12 github-actions[bot]

This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.

github-actions[bot] avatar Jan 07 '23 00:01 github-actions[bot]