hyper-api-samples
hyper-api-samples copied to clipboard
read a parquet file error
when count(*)
> select count(*) from external('c:/t/t.parquet') ;
Error executing SQL: Error while reading parquet file: Error in c:/t/t.parquet
Error while decoding row group 327 column chunk for column 'id' of type
'INT32' at offset 1419196279 of size 201495
Decompressing compressed page of type 'DATA_PAGE_V2' at offset 141919627
9 with codec 'ZSTD' failed
(compressed region offset: 1419204506, compressed size: 193268, expected
uncompressed size: 270339)
Actual uncompressed size (262144 bytes) of ZSTD compressed data is less
than expected (270339 bytes)
Hint: The file is probably corrupt.
Context: 0xfa6b0e2f
when fetch first few lines
> select * from external('c:/t/t.parquet') limit 5;
Error executing SQL: Error while reading parquet file: Error in c:/t/t.parquet
Error while decoding row group 15 column chunk for column 'id' of type '
INT32' at offset 63856848 of size 180917
Decompressing compressed page of type 'DATA_PAGE_V2' at offset 63856848
with codec 'ZSTD' failed
(compressed region offset: 63864694, compressed size: 173071, expected u
ncompressed size: 257803)
Actual uncompressed size (249988 bytes) of ZSTD compressed data is less
than expected (257803 bytes)
Hint: The file is probably corrupt.
Context: 0xfa6b0e2f
both 'duckdb' and 'polars' can read the same file.
sql.execute('SELECT max(id) a,min(id) b FROM t',eager=True)
shape: (1, 2)
┌──────────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i32 ┆ i32 │
╞══════════╪═════╡
│ 30000000 ┆ 1 │
└──────────┴─────┘
Interesting find. Can you share the file with us?
it's too big(2.3GB), and I am trying to find a smaller exmaple.
I got some info of the bad file
import pyarrow.parquet as pq
parquet_file = 'c:/t/t.parquet'
>>> metadata = pq.ParquetFile(parquet_file).metadata
>>> print(metadata)
<pyarrow._parquet.FileMetaData object at 0x0000000004FE0AE0>
created_by: Arrow2 - Native Rust implementation of Arrow
num_columns: 83
num_rows: 5000000
num_row_groups: 95
format_version: 2.6
serialized_size: 518729
another good file info
>>> metadata = pq.ParquetFile(parquet_file).metadata
>>> print(metadata)
<pyarrow._parquet.FileMetaData object at 0x000000000500AAE0>
created_by: DuckDB
num_columns: 6
num_rows: 50000000
num_row_groups: 499
format_version: 1.0
serialized_size: 275300