seatunnel
seatunnel copied to clipboard
[Feature][Connector-V2-Clickhouse] Clickhouse Source support multi-split read.
subtask of #2789
Assgin to @wuxizhi777
This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.
This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.
@wuxizhi777 Hi, Any progress?
Hi, is this pr still going on? I can come and finish him if needed.
Hi, is this pr still going on? I can come and finish him if needed.
Great! I assigned to you.
This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.
This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.
This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.
- support split read
- add e2e testcase
- update docs
reference https://github.com/apache/incubator-seatunnel/tree/dev/seatunnel-connectors-v2/connector-iotdb https://github.com/apache/incubator-seatunnel/tree/dev/seatunnel-connectors-v2/connector-jdbc https://github.com/apache/incubator-seatunnel/tree/dev/seatunnel-e2e/seatunnel-connector-v2-e2e/connector-jdbc-e2e
@hailin0 @Hisoka-X To implement ClickHouse sharded reading, it is necessary to change the original HTTP submission of SQL queries to be consistent with the JDBC submission used in the sink side. The reasons are as follows:
Limited concurrent reading capability of HTTP.
Limited data transfer capacity.
To achieve sharding capabilities, JDBC's prepared statement feature is required. For example, the SQL query "select a, b from test" can be rewritten using JDBC's prepared statement capability as "select a, b from test where a between ? and ?". Then, by applying a predefined sharding strategy, the original SQL query "select a, b from test" can be split into multiple queries like:
"select a, b from test where a between 1 and 50",
"select a, b from test where a between 50 and 100",
"select a, b from test where a between 100 and 200".
By using a SplitEnumerator, the SQL queries can be distributed to different readers, thus achieving parallel reading capabilities.
Regarding the definition of fragmentation parameters: scan.partition.column: Partition column name. This configuration item specifies the column name used for partitioning. Data will be split into multiple partitions based on the values of this column, enabling parallel reading. Typically, you should choose a column with a good data distribution and related to the query conditions as the partition column.
scan.partition.num: Number of partitions. This configuration item specifies the number of partitions the data source should be split into. A larger value can increase parallelism, thereby improving read speed, but may also increase the demand for memory and computational resources. A smaller value may result in slower read speeds but lower resource usage. You can adjust this value according to actual requirements and available resources.
scan.partition.lower-bound: Lower bound of the partition column. This configuration item specifies the minimum value of the partition column. It is used to define the range of data partitions, ensuring that all data in the data source with partition column values greater than or equal to this lower bound will be read.
scan.partition.upper-bound: Upper bound of the partition column. This configuration item specifies the maximum value of the partition column. It is used to define the range of data partitions, ensuring that all data in the data source with partition column values less than or equal to this upper bound will be read.
Compatibility: No compatibility issues