hyper-api-samples icon indicating copy to clipboard operation
hyper-api-samples copied to clipboard

read a parquet file error

Open l1t1 opened this issue 11 months ago • 4 comments

when count(*)

> select count(*) from external('c:/t/t.parquet') ;
Error executing SQL: Error while reading parquet file: Error in c:/t/t.parquet
        Error while decoding row group 327 column chunk for column 'id' of type
'INT32' at offset 1419196279 of size 201495
        Decompressing compressed page of type 'DATA_PAGE_V2' at offset 141919627
9 with codec 'ZSTD' failed
        (compressed region offset: 1419204506, compressed size: 193268, expected
 uncompressed size: 270339)
        Actual uncompressed size (262144 bytes) of ZSTD compressed data is less
than expected (270339 bytes)
Hint: The file is probably corrupt.
Context: 0xfa6b0e2f

when fetch first few lines

> select * from external('c:/t/t.parquet') limit 5;
Error executing SQL: Error while reading parquet file: Error in c:/t/t.parquet
        Error while decoding row group 15 column chunk for column 'id' of type '
INT32' at offset 63856848 of size 180917
        Decompressing compressed page of type 'DATA_PAGE_V2' at offset 63856848
with codec 'ZSTD' failed
        (compressed region offset: 63864694, compressed size: 173071, expected u
ncompressed size: 257803)
        Actual uncompressed size (249988 bytes) of ZSTD compressed data is less
than expected (257803 bytes)
Hint: The file is probably corrupt.
Context: 0xfa6b0e2f

l1t1 avatar Mar 07 '24 03:03 l1t1

both 'duckdb' and 'polars' can read the same file.

sql.execute('SELECT max(id) a,min(id) b FROM t',eager=True)
shape: (1, 2)
┌──────────┬─────┐
│ a        ┆ b   │
│ ---      ┆ --- │
│ i32      ┆ i32 │
╞══════════╪═════╡
│ 30000000 ┆ 1   │
└──────────┴─────┘

l1t1 avatar Mar 07 '24 03:03 l1t1

Interesting find. Can you share the file with us?

jkammerer avatar Mar 11 '24 09:03 jkammerer

it's too big(2.3GB), and I am trying to find a smaller exmaple.

l1t1 avatar Mar 12 '24 05:03 l1t1

I got some info of the bad file

import pyarrow.parquet as pq
parquet_file = 'c:/t/t.parquet'
>>> metadata = pq.ParquetFile(parquet_file).metadata
>>> print(metadata)
<pyarrow._parquet.FileMetaData object at 0x0000000004FE0AE0>
  created_by: Arrow2 - Native Rust implementation of Arrow
  num_columns: 83
  num_rows: 5000000
  num_row_groups: 95
  format_version: 2.6
  serialized_size: 518729

another good file info

>>> metadata = pq.ParquetFile(parquet_file).metadata
>>> print(metadata)
<pyarrow._parquet.FileMetaData object at 0x000000000500AAE0>
  created_by: DuckDB
  num_columns: 6
  num_rows: 50000000
  num_row_groups: 499
  format_version: 1.0
  serialized_size: 275300

l1t1 avatar Mar 13 '24 06:03 l1t1