polars icon indicating copy to clipboard operation
polars copied to clipboard

`OverflowError` when reading iceberg table with > 2**32 rows

Open KevinJiao opened this issue 2 weeks ago • 1 comments

Checks

  • [x] I have checked that this issue has not already been reported.
  • [x] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

  warehouse_path = str(tempfile.mkdtemp())

  catalog = SqlCatalog(
      "test_catalog",
      **{
          "uri": f"sqlite:///{warehouse_path}/catalog.db",
          "warehouse": f"file://{warehouse_path}",
      },
  )
  catalog.create_namespace_if_not_exists("test_ns")
  table = catalog.create_table(
      "test_ns.test_table",
      schema=pa.schema([pa.field("flag", pa.bool_())]),
  )

  # uint32 max is 4.29B)
  row_count = 4_400_000_000
  flags = np.ones(row_count, dtype=np.bool_)
  data = pa.table({"flag": pa.array(flags)})
  table.append(data)

  iceberg_table = catalog.load_table("test_ns.test_table")

  lf = pl.scan_iceberg(iceberg_table)
  try:
      print(lf.head(5).collect())
  except Exception as e:
      raise e
  finally:
      shutil.rmtree(warehouse_path)

results in

OverflowError: out of range integral type conversion attempted

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
...
TypeError: failed to extract field Extract.row_count

Log output

python dataset: convert from arrow schema
expand_datasets(): python[IcebergDataset]: limit: Some(5), project: all
IcebergDataset: to_dataset_scan(): snapshot ID: None, limit: 5, projection: None, filter_columns: None, pyarrow_predicate: None, iceberg_table_filter: None, self._use_metadata_statistics: True
IcebergDataset: to_dataset_scan(): tbl.metadata.current_snapshot_id: 5969186588860846356
IcebergDataset: to_dataset_scan(): begin path expansion
IcebergDataset: to_dataset_scan(): finish path expansion (0.004s)
IcebergDataset: to_dataset_scan(): native scan_parquet(): 2 sources, snapshot ID: None, schema ID: 0, 0 deletion files
_init_credential_provider_builder(): credential_provider_init = None
OverflowError: out of range integral type conversion attempted

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/home/kevin.jiao/repos/code/test_iceberg_polars_overflow.py", line 54, in <module>
    main()
  File "/mnt/home/kevin.jiao/repos/code/test_iceberg_polars_overflow.py", line 49, in main
    raise e
  File "/mnt/home/kevin.jiao/repos/code/test_iceberg_polars_overflow.py", line 47, in main
    print(lf.head(5).collect())
          ^^^^^^^^^^^^^^^^^^^^
  File "/mnt/home/kevin.jiao/repos/code/.venv/lib/python3.12/site-packages/polars/_utils/deprecation.py", line 97, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/home/kevin.jiao/repos/code/.venv/lib/python3.12/site-packages/polars/lazyframe/opt_flags.py", line 328, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/home/kevin.jiao/repos/code/.venv/lib/python3.12/site-packages/polars/lazyframe/frame.py", line 2422, in collect
    return wrap_df(ldf.collect(engine, callback))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/home/kevin.jiao/repos/code/.venv/lib/python3.12/site-packages/polars/io/iceberg/dataset.py", line 95, in to_dataset_scan
    return scan_data.to_lazyframe(), scan_data.snapshot_id_key
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/home/kevin.jiao/repos/code/.venv/lib/python3.12/site-packages/polars/io/iceberg/dataset.py", line 529, in to_lazyframe
    return scan_parquet(
           ^^^^^^^^^^^^^
  File "/mnt/home/kevin.jiao/repos/code/.venv/lib/python3.12/site-packages/polars/_utils/deprecation.py", line 128, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/home/kevin.jiao/repos/code/.venv/lib/python3.12/site-packages/polars/_utils/deprecation.py", line 128, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/home/kevin.jiao/repos/code/.venv/lib/python3.12/site-packages/polars/io/parquet/functions.py", line 672, in scan_parquet
    pylf = PyLazyFrame.new_from_parquet(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: failed to extract field Extract.row_count

Issue description

There's an integer overflow bug in the rust/polars execution interface that triggers when reading from a table with > 2**32 rows.

Backtesting shows that v1.35.0 is the version version with the bug, 1.34.0 works as expected

Expected behavior

Reading an iceberg table with >2**32 rows succeeds

Installed versions

>>> pl.show_versions()
--------Version info---------
Polars:              1.35.0
Index type:          UInt32
Platform:            Linux-6.14.0-1016-aws-x86_64-with-glibc2.39
Python:              3.12.3 (main, Dec 18 2024, 01:26:45) [GCC 11.4.0]
Runtime:             rt32

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  <not installed>
altair               5.5.0
azure.identity       <not installed>
boto3                1.41.5
cloudpickle          3.1.1
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               2023.12.2
gevent               <not installed>
google.auth          2.43.0
great_tables         <not installed>
matplotlib           3.10.7
numpy                1.26.4
openpyxl             3.1.2
pandas               2.3.3
polars_cloud         <not installed>
pyarrow              20.0.0
pydantic             2.12.5
pyiceberg            0.10.0
sqlalchemy           2.0.44
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>

KevinJiao avatar Dec 05 '25 00:12 KevinJiao

You're using the 32-bit runtime. Install polars[rt64] if you want to process more than 2^32 rows.

orlp avatar Dec 05 '25 11:12 orlp