seatunnel
seatunnel copied to clipboard
[Feature][Discuss] Uneven data distribution.
Search before asking
- [X] I had searched in the feature and found no similar feature requirement.
Description
In fact, there are often abnormal data, such as partition key value range in [1, 10000] 100, 99 datas in [1-100], and 1 data is 10000.
In addition, among 100 pieces of data, 99 pieces of data have a total size of 1mb, and one piece of data has a total size of 10mb.
Can we resolve this problem?
Usage Scenario
No response
Related issues
No response
Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
It seems that we need a more personalized sharding algorithm, including the size and quantity of data, and then for different sources, we can provide a unified interface, which can be implemented by the Connector that can support this sharding algorithm.
This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.
This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.
This is a good issue.Temporarily assigned to me.
https://shardingsphere.apache.org/document/current/en/dev-manual/sharding/
https://shardingsphere.apache.org/document/current/en/dev-manual/sharding/
Regarding the jdbc data source, most of the primary keys may be data such as uuid. We can't just be restricted to numeric types,hash mode is a better sharding method, of course we can provide more sharding methods.
https://shardingsphere.apache.org/document/current/en/dev-manual/sharding/
Regarding the jdbc data source, most of the primary keys may be data such as uuid. We can't just be restricted to numeric types,hash mode is a better sharding method, of course we can provide more sharding methods.
@ic4y @Hisoka-X Do you have any good suggestions?
https://shardingsphere.apache.org/document/current/en/dev-manual/sharding/
Regarding the jdbc data source, most of the primary keys may be data such as uuid. We can't just be restricted to numeric types,hash mode is a better sharding method, of course we can provide more sharding methods.
Hash maybe have some problem in single database, the each split read data may discontinuous, the preformance not good. And how to use sql to describe it also is a problem.
https://shardingsphere.apache.org/document/current/en/dev-manual/sharding/
Regarding the jdbc data source, most of the primary keys may be data such as uuid. We can't just be restricted to numeric types,hash mode is a better sharding method, of course we can provide more sharding methods.
Hash maybe have some problem in single database, the each split read data may discontinuous, the preformance not good. And how to use sql to describe it also is a problem.
SQL needs to be in the dialect and needs to be tested and optimized for performance.
select * from test where id mod 2=1
https://shardingsphere.apache.org/document/current/en/dev-manual/sharding/
Regarding the jdbc data source, most of the primary keys may be data such as uuid. We can't just be restricted to numeric types,hash mode is a better sharding method, of course we can provide more sharding methods.
Hash maybe have some problem in single database, the each split read data may discontinuous, the preformance not good. And how to use sql to describe it also is a problem.
SQL needs to be in the dialect and needs to be tested and optimized for performance.
select * from test where id mod 2=1
Seem like every split will scan all data.
https://shardingsphere.apache.org/document/current/en/dev-manual/sharding/
Regarding the jdbc data source, most of the primary keys may be data such as uuid. We can't just be restricted to numeric types,hash mode is a better sharding method, of course we can provide more sharding methods.
Hash maybe have some problem in single database, the each split read data may discontinuous, the preformance not good. And how to use sql to describe it also is a problem.
SQL needs to be in the dialect and needs to be tested and optimized for performance.
select * from test where id mod 2=1
Seem like every split will scan all data.
This is related to the storage form of the data, may or may not. But often there will be some performance loss
Yeah, but not all primary keys are numeric and that's something to think about
Best Regards
liugddx @.***
------------------ Original ------------------ From: @.>; Date: 2022年11月30日(星期三) 下午2:26 To: @.>; Cc: @.>; @.>; Subject: Re: [apache/incubator-seatunnel] [Feature][Discuss] Uneven data distribution. (Issue #2861)
https://shardingsphere.apache.org/document/current/en/dev-manual/sharding/
Regarding the jdbc data source, most of the primary keys may be data such as uuid. We can't just be restricted to numeric types,hash mode is a better sharding method, of course we can provide more sharding methods.
Hash maybe have some problem in single database, the each split read data may discontinuous, the preformance not good. And how to use sql to describe it also is a problem.
SQL needs to be in the dialect and needs to be tested and optimized for performance.
select * from test where id mod 2=1
Seem like every split will scan all data.
This is related to the storage form of the data, may or may not. But often there will be some performance loss
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were assigned.Message ID: @.***>
This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.
This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.