ClickHouse icon indicating copy to clipboard operation
ClickHouse copied to clipboard

INSERT ... SELECT to S3 Parquet output format does not always preserve the sort order.

Open hodgesrm opened this issue 4 months ago • 4 comments

Describe the bug Under circumstances that are not fully clear a command like the following generates Parquet files whose rows are not sorted within row groups.

INSERT INTO FUNCTION s3('s3://someuser-ice-rest-catalog-demo/database/table/data/partition={_partition_id}.parquet', 'Parquet')
PARTITION BY (c1, c2)
SELECT * FROM <database>.<table>
ORDER BY c3, c4, c5, c6, c7

You can confirm the problem by trying to load the Parquet files using ice insert like the following example:

ice insert database.table -p \
's3://someuser-ice-rest-catalog-demo/database/table/data/*.parquet' \
--no-copy --thread-count=10 \
 --partition='[{"column":"c1", "transform": "day"}, {"column":"c1"}]' \
 --sort='[{"column":"c3","desc":false}, {"column":"c4", "desc": false}, {"column":"c5", "desc": false}, {"column":"c6", "desc": false}, {"column":"c7", "desc": false}]'

It fails with an exception:

2025-08-29 04:12:08 [main] ERROR c.a.ice.rest.catalog.Main > Fatal
java.io.IOException: Error processing file(s)
        at com.altinity.ice.cli.internal.cmd.Insert.run(Insert.java:230)
        at com.altinity.ice.cli.Main.insert(Main.java:362)
        at com.altinity.ice.rest.catalog.Main.main(Main.java:523) [10 skipped]
Caused by: java.io.IOException: Error processing s3://rhodges-ice-rest-catalog-demo/mergetree/dns_mtree_repl_v2_0/data/partition=('2025-01-02',30).parquet
        at com.altinity.ice.cli.internal.cmd.Insert.lambda$run$3(Insert.java:215)
 [4 skipped]
Caused by: org.apache.iceberg.exceptions.BadRequestException: s3://someuser-ice-rest-catalog-demo/database/table/data/partition=('2025-01-02',30).parquet does not appear to be sorted
        at com.altinity.ice.cli.internal.cmd.Insert.processFile(Insert.java:372)
        at com.altinity.ice.cli.internal.cmd.Insert.lambda$run$3(Insert.java:194)
        ... 4 common frames omitted

@shyiko diagnosed this failure using ice diagnostic messages and demonstrated that the file in question is not properly sorted within the row group:

Record(0, 0, 0, 10000, 1735789600)
Record(0, 0, 0, 11000, 1735799600)
Record(0, 0, 0, 12000, 1735809600)
Record(0, 0, 0, 13000, 1735819600)
Record(0, 0, 0, 14000, 1735829600)
Record(0, 0, 0, 15000, 1735839600)
Record(0, 0, 0, 16000, 1735849600)
Record(0, 0, 0, 17000, 1735859600)
Record(0, 0, 0, 9000, 1735779600)
expected Record(0, 0, 0, 9000, 1735779600) to be before Record(0, 0, 0, 17000, 1735859600)
```


**To Reproduce**
(Currently it can only be reproduced with a customer schema.) 
Steps to reproduce the behavior:
1. Go to '...'
2. Click on '....'
3. Scroll down to '....'
4. See error

**Expected behavior**
Table should load. 

**Key information**
Provide relevant runtime details. 
 - Project Antalya Build Version: 25.3.3.20186
 - Cloud provider, e.g., AWS
 - Kubernetes provider, e.g., EKS
 - Object storage, e.g., AWS S3
 - Iceberg catalog, e.g., ice-rest-catalog version 0.6.0. 

**Additional context**
None. 

hodgesrm avatar Aug 29 '25 04:08 hodgesrm

What if you try to disable output_format_parquet_parallel_encoding ?

filimonov avatar Aug 29 '25 13:08 filimonov

@filimonov, I will try it.

hodgesrm avatar Aug 29 '25 16:08 hodgesrm

Disabling output_format_parquet_parallel does not help. The Iceberg Java APIs give the same error. For now there is one workaround:

  • Omit sort orders when loading Apache Iceberg data.

This is less onerous than it appears at first, but the topic needs deeper analysis. Here's the current understanding.

  1. ClickHouse does not currently use Iceberg ordering to plan queries. So there is no practical effect on ClickHouse planning or query execution, though that will likely change in future.
  2. ClickHouse does look at min/max values in Parquet columns. It must read files to get this information but Antalya swarm queries will cache the metadata so the lack of sort information in Iceberg metadata should have minimal impact as long as Parquet metadata caching is active.
  3. Sorting does make row groups more efficient if it ensures that they are properly partitioned by min/max value, because it makes the row group contents more selective. It's unclear how much of an impact this has.
  4. The behavior means that other tools like Apache Spark that understand and use Iceberg sort ordering may be less performant. This might have practical effects on important operations like compaction.

hodgesrm avatar Aug 29 '25 16:08 hodgesrm

Looks as the problem may be in the ice test, not ClickHouse data. Fixed in ice 0.6.1: https://github.com/Altinity/ice/releases/tag/v0.6.1.

hodgesrm avatar Sep 04 '25 18:09 hodgesrm