polars
polars copied to clipboard
`OverflowError` when reading iceberg table with > 2**32 rows
Checks
- [x] I have checked that this issue has not already been reported.
- [x] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
warehouse_path = str(tempfile.mkdtemp())
catalog = SqlCatalog(
"test_catalog",
**{
"uri": f"sqlite:///{warehouse_path}/catalog.db",
"warehouse": f"file://{warehouse_path}",
},
)
catalog.create_namespace_if_not_exists("test_ns")
table = catalog.create_table(
"test_ns.test_table",
schema=pa.schema([pa.field("flag", pa.bool_())]),
)
# uint32 max is 4.29B)
row_count = 4_400_000_000
flags = np.ones(row_count, dtype=np.bool_)
data = pa.table({"flag": pa.array(flags)})
table.append(data)
iceberg_table = catalog.load_table("test_ns.test_table")
lf = pl.scan_iceberg(iceberg_table)
try:
print(lf.head(5).collect())
except Exception as e:
raise e
finally:
shutil.rmtree(warehouse_path)
results in
OverflowError: out of range integral type conversion attempted
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
...
TypeError: failed to extract field Extract.row_count
Log output
python dataset: convert from arrow schema
expand_datasets(): python[IcebergDataset]: limit: Some(5), project: all
IcebergDataset: to_dataset_scan(): snapshot ID: None, limit: 5, projection: None, filter_columns: None, pyarrow_predicate: None, iceberg_table_filter: None, self._use_metadata_statistics: True
IcebergDataset: to_dataset_scan(): tbl.metadata.current_snapshot_id: 5969186588860846356
IcebergDataset: to_dataset_scan(): begin path expansion
IcebergDataset: to_dataset_scan(): finish path expansion (0.004s)
IcebergDataset: to_dataset_scan(): native scan_parquet(): 2 sources, snapshot ID: None, schema ID: 0, 0 deletion files
_init_credential_provider_builder(): credential_provider_init = None
OverflowError: out of range integral type conversion attempted
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/mnt/home/kevin.jiao/repos/code/test_iceberg_polars_overflow.py", line 54, in <module>
main()
File "/mnt/home/kevin.jiao/repos/code/test_iceberg_polars_overflow.py", line 49, in main
raise e
File "/mnt/home/kevin.jiao/repos/code/test_iceberg_polars_overflow.py", line 47, in main
print(lf.head(5).collect())
^^^^^^^^^^^^^^^^^^^^
File "/mnt/home/kevin.jiao/repos/code/.venv/lib/python3.12/site-packages/polars/_utils/deprecation.py", line 97, in wrapper
return function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/home/kevin.jiao/repos/code/.venv/lib/python3.12/site-packages/polars/lazyframe/opt_flags.py", line 328, in wrapper
return function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/home/kevin.jiao/repos/code/.venv/lib/python3.12/site-packages/polars/lazyframe/frame.py", line 2422, in collect
return wrap_df(ldf.collect(engine, callback))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/home/kevin.jiao/repos/code/.venv/lib/python3.12/site-packages/polars/io/iceberg/dataset.py", line 95, in to_dataset_scan
return scan_data.to_lazyframe(), scan_data.snapshot_id_key
^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/home/kevin.jiao/repos/code/.venv/lib/python3.12/site-packages/polars/io/iceberg/dataset.py", line 529, in to_lazyframe
return scan_parquet(
^^^^^^^^^^^^^
File "/mnt/home/kevin.jiao/repos/code/.venv/lib/python3.12/site-packages/polars/_utils/deprecation.py", line 128, in wrapper
return function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/home/kevin.jiao/repos/code/.venv/lib/python3.12/site-packages/polars/_utils/deprecation.py", line 128, in wrapper
return function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/home/kevin.jiao/repos/code/.venv/lib/python3.12/site-packages/polars/io/parquet/functions.py", line 672, in scan_parquet
pylf = PyLazyFrame.new_from_parquet(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: failed to extract field Extract.row_count
Issue description
There's an integer overflow bug in the rust/polars execution interface that triggers when reading from a table with > 2**32 rows.
Backtesting shows that v1.35.0 is the version version with the bug, 1.34.0 works as expected
Expected behavior
Reading an iceberg table with >2**32 rows succeeds
Installed versions
>>> pl.show_versions()
--------Version info---------
Polars: 1.35.0
Index type: UInt32
Platform: Linux-6.14.0-1016-aws-x86_64-with-glibc2.39
Python: 3.12.3 (main, Dec 18 2024, 01:26:45) [GCC 11.4.0]
Runtime: rt32
----Optional dependencies----
Azure CLI <not installed>
adbc_driver_manager <not installed>
altair 5.5.0
azure.identity <not installed>
boto3 1.41.5
cloudpickle 3.1.1
connectorx <not installed>
deltalake <not installed>
fastexcel <not installed>
fsspec 2023.12.2
gevent <not installed>
google.auth 2.43.0
great_tables <not installed>
matplotlib 3.10.7
numpy 1.26.4
openpyxl 3.1.2
pandas 2.3.3
polars_cloud <not installed>
pyarrow 20.0.0
pydantic 2.12.5
pyiceberg 0.10.0
sqlalchemy 2.0.44
torch <not installed>
xlsx2csv <not installed>
xlsxwriter <not installed>
You're using the 32-bit runtime. Install polars[rt64] if you want to process more than 2^32 rows.