parquet-go icon indicating copy to clipboard operation
parquet-go copied to clipboard

Corrupted Parquet Statistics in Trino SQL

Open tjc-enoch opened this issue 1 year ago • 2 comments

Hello,

We are using parquet-go v1.6.2 to convert files into parquet. When they hit our SQL database Trino v380 we get this error when querying:

2024-06-18T20:23:45.343Z ERROR stage-scheduler io.trino.execution.StageStateMachine Stage 20240618_202345_03674_xm9wc.1 failed io.trino.spi.TrinoException: Corrupted statistics for column "filename" in Parquet file "s3a://<REDACTED>/date_part=2024-06-18/<REDACTED>.parquet". Corrupted column index: [Boudary order: UNORDERED null count min max page-0 <REDACTED> <REDACTED> page-1 <REDACTED> <REDACTED> page-2 <REDACTED> <REDACTED> page-3 <REDACTED> <REDACTED> page-4 <REDACTED> <REDACTED> page-5 <REDACTED> <REDACTED> page-6 <REDACTED> <REDACTED> page-7 <REDACTED> <REDACTED> ] at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:278) at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:164) at io.trino.plugin.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:290) at io.trino.plugin.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:195) at io.trino.plugin.base.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:49) at io.trino.split.PageSourceManager.createPageSource(PageSourceManager.java:68) at io.trino.operator.ScanFilterAndProjectOperator$SplitToPages.process(ScanFilterAndProjectOperator.java:268) at io.trino.operator.ScanFilterAndProjectOperator$SplitToPages.process(ScanFilterAndProjectOperator.java:196) at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:338) at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:391) at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:325) at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:391) at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:325) at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:391) at io.trino.operator.WorkProcessorUtils.getNextState(WorkProcessorUtils.java:240) at io.trino.operator.WorkProcessorUtils.lambda$processStateMonitor$3(WorkProcessorUtils.java:219) at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:391) at io.trino.operator.WorkProcessorUtils.getNextState(WorkProcessorUtils.java:240) at io.trino.operator.WorkProcessorUtils.lambda$finishWhen$4(WorkProcessorUtils.java:234) at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:391) at io.trino.operator.WorkProcessorSourceOperatorAdapter.getOutput(WorkProcessorSourceOperatorAdapter.java:150) at io.trino.operator.Driver.processInternal(Driver.java:388) at io.trino.operator.Driver.lambda$processFor$9(Driver.java:292) at io.trino.operator.Driver.tryWithLock(Driver.java:693) at io.trino.operator.Driver.processFor(Driver.java:285) at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1092) at io.trino.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163) at io.trino.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:488) at io.trino.$gen.Trino_380____20240612_170007_2.run(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829)

This error could be an error on trino's side but Im opening this issue here because from looking at other parquet files converted elsewhere, there are some column statistics left out. Namely the column order which seems the be the problem here.

tjc-enoch avatar Jun 18 '24 22:06 tjc-enoch

hey, not sure 100%, but this could be linked to https://github.com/xitongsys/parquet-go/issues/547

robertino avatar Jul 11 '24 11:07 robertino

As per https://github.com/trinodb/trino/issues/24840 this lib is broken.

chippyash avatar Jan 31 '25 06:01 chippyash