seatunnel icon indicating copy to clipboard operation
seatunnel copied to clipboard

[Feature][Connector-V2-Clickhouse] Clickhouse Source support multi-split read.

Open Hisoka-X opened this issue 2 years ago • 4 comments

subtask of #2789

Hisoka-X avatar Sep 20 '22 08:09 Hisoka-X

Assgin to @wuxizhi777

Hisoka-X avatar Sep 20 '22 08:09 Hisoka-X

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

github-actions[bot] avatar Oct 21 '22 00:10 github-actions[bot]

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

github-actions[bot] avatar Nov 23 '22 00:11 github-actions[bot]

@wuxizhi777 Hi, Any progress?

Hisoka-X avatar Nov 23 '22 02:11 Hisoka-X

Hi, is this pr still going on? I can come and finish him if needed.

MonsterChenzhuo avatar Dec 14 '22 13:12 MonsterChenzhuo

Hi, is this pr still going on? I can come and finish him if needed.

Great! I assigned to you.

Hisoka-X avatar Dec 14 '22 13:12 Hisoka-X

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

github-actions[bot] avatar Jan 14 '23 00:01 github-actions[bot]

This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.

github-actions[bot] avatar Feb 13 '23 00:02 github-actions[bot]

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

github-actions[bot] avatar Mar 22 '23 00:03 github-actions[bot]

Clickhouse Source

  • support split read
  • add e2e testcase
  • update docs

reference https://github.com/apache/incubator-seatunnel/tree/dev/seatunnel-connectors-v2/connector-iotdb https://github.com/apache/incubator-seatunnel/tree/dev/seatunnel-connectors-v2/connector-jdbc https://github.com/apache/incubator-seatunnel/tree/dev/seatunnel-e2e/seatunnel-connector-v2-e2e/connector-jdbc-e2e

hailin0 avatar Mar 28 '23 07:03 hailin0

@hailin0 @Hisoka-X To implement ClickHouse sharded reading, it is necessary to change the original HTTP submission of SQL queries to be consistent with the JDBC submission used in the sink side. The reasons are as follows:

Limited concurrent reading capability of HTTP. Limited data transfer capacity. To achieve sharding capabilities, JDBC's prepared statement feature is required. For example, the SQL query "select a, b from test" can be rewritten using JDBC's prepared statement capability as "select a, b from test where a between ? and ?". Then, by applying a predefined sharding strategy, the original SQL query "select a, b from test" can be split into multiple queries like: "select a, b from test where a between 1 and 50", "select a, b from test where a between 50 and 100", "select a, b from test where a between 100 and 200". By using a SplitEnumerator, the SQL queries can be distributed to different readers, thus achieving parallel reading capabilities. 图片

MonsterChenzhuo avatar May 09 '23 10:05 MonsterChenzhuo

Regarding the definition of fragmentation parameters: scan.partition.column: Partition column name. This configuration item specifies the column name used for partitioning. Data will be split into multiple partitions based on the values of this column, enabling parallel reading. Typically, you should choose a column with a good data distribution and related to the query conditions as the partition column.

scan.partition.num: Number of partitions. This configuration item specifies the number of partitions the data source should be split into. A larger value can increase parallelism, thereby improving read speed, but may also increase the demand for memory and computational resources. A smaller value may result in slower read speeds but lower resource usage. You can adjust this value according to actual requirements and available resources.

scan.partition.lower-bound: Lower bound of the partition column. This configuration item specifies the minimum value of the partition column. It is used to define the range of data partitions, ensuring that all data in the data source with partition column values greater than or equal to this lower bound will be read.

scan.partition.upper-bound: Upper bound of the partition column. This configuration item specifies the maximum value of the partition column. It is used to define the range of data partitions, ensuring that all data in the data source with partition column values less than or equal to this upper bound will be read.

MonsterChenzhuo avatar May 09 '23 10:05 MonsterChenzhuo

Compatibility: No compatibility issues

MonsterChenzhuo avatar May 09 '23 12:05 MonsterChenzhuo