Tooling to migrate data from ClickHouse to Iceberg

Open hodgesrm opened this issue 8 months ago • 0 comments

Is your feature request related to a problem? Please describe. As Iceberg adoption increases we need tooling to copy data quickly from MergeTree tables into Iceberg for side-by-side testing. ClickHouse has many of the capabilities to do this and Project Antalya is adding more.

Describe the solution you'd like We need a simple way to identify ClickHouse tables and copy the schema and data into Iceberg. It would include the following arguments.

ClickHouse source table.
Target Iceberg installation (REST server and S3 endpoint)
Optional SQL query definition to select data.
- It would be nice to be able to select parts or partitions as well

The migration tool would work as follows.

Create a schema in Iceberg including the table sort order and partitioning.
SELECT table data into Parquet files on S3 divided by partition key and observing the table sort order.
Register new files with Iceberg table.

For large tables it would be convenient to chunk the data and have the ability to resume after a failure. As Project Antalya adds capabilities we would want to offload Parquet part generation to swarm clusters.

Describe alternatives you've considered

It's possible already to do this with Python scripts (e.g., pyiceberg + pyarrow) but this approach is hard to scale. ClickHouse can already distribute Parquet file generation across clusters and scans data very efficiently.

We could also use Spark but this is slow and cumbersome. We need simple, fast tools for this purpose.

Additional context

The copy capability is kind of similar to backup. It would be interesting to explore synergies with backup/restore.
The tooling needs to work on the command line as well as in containers running on Kubernetes.

May 03 '25 20:05 hodgesrm