polars
polars copied to clipboard
Scan parquet fails when shrinking `dtype` as part of the lazy operations
Checks
- [X] I have checked that this issue has not already been reported.
- [X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
import polars as pl
data = pl.DataFrame({'a': [0, 1, 1000, 2000]})
data[0:2].write_parquet('part1.parquet')
data[2:4].write_parquet('part2.parquet')
# read_parquet works with shrinking dtypes
data_read = pl.read_parquet('part*.parquet')
print(data_read)
data_read_shrink = pl.read_parquet('part*.parquet').select(pl.all().shrink_dtype())
print(data_read_shrink)
# scan_parquet does not work with shrinking dtypes, no matter where we collect
data_scan = pl.scan_parquet('part*.parquet').collect()
print(data_scan)
data_scan_shrink_greedy = pl.scan_parquet('part*.parquet').collect().select(pl.all().shrink_dtype())
data_scan_shrink_lazy = pl.scan_parquet('part*.parquet').select(pl.all().shrink_dtype()).collect()
The generated errors are on the two scan_parquet
lines:
>>> data = pl.DataFrame({'a': [0, 1, 1000, 2000]})
>>>
>>> data[0:2].write_parquet('part1.parquet')
>>> data[2:4].write_parquet('part2.parquet')
>>>
>>> # read_parquet works with shrinking dtypes
>>>
>>> data_read = pl.read_parquet('part*.parquet')
>>>
>>> print(data_read)
shape: (4, 1)
┌──────┐
│ a │
│ --- │
│ i64 │
╞══════╡
│ 0 │
│ 1 │
│ 1000 │
│ 2000 │
└──────┘
>>>
>>> data_read_shrink = pl.read_parquet('part*.parquet').select(pl.all().shrink_dtype())
>>>
>>> print(data_read_shrink)
shape: (4, 1)
┌──────┐
│ a │
│ --- │
│ i16 │
╞══════╡
│ 0 │
│ 1 │
│ 1000 │
│ 2000 │
└──────┘
>>>
>>> # scan_parquet does not work with shrinking dtypes, no matter where we collect
>>>
>>> data_scan = pl.scan_parquet('part*.parquet').collect()
>>>
>>> print(data_scan)
shape: (4, 1)
┌──────┐
│ a │
│ --- │
│ i64 │
╞══════╡
│ 0 │
│ 1 │
│ 1000 │
│ 2000 │
└──────┘
>>>
>>> data_scan_shrink_greedy = pl.scan_parquet('part*.parquet').collect().select(pl.all().shrink_dtype())
thread 'python' panicked at crates/polars-core/src/frame/mod.rs:957:36:
should not fail: SchemaMismatch(ErrString("cannot extend/append Int8 with Int16"))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/workspaces/podium/packages/podium-configs/.venv/lib/python3.11/site-packages/polars/dataframe/frame.py", line 7701, in select
return self.lazy().select(*exprs, **named_exprs).collect(_eager=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspaces/podium/packages/podium-configs/.venv/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 1708, in collect
return wrap_df(ldf.collect())
^^^^^^^^^^^^^
pyo3_runtime.PanicException: should not fail: SchemaMismatch(ErrString("cannot extend/append Int8 with Int16"))
>>> data_scan_shrink_lazy = pl.scan_parquet('part*.parquet').select(pl.all().shrink_dtype()).collect()
thread 'python' panicked at crates/polars-core/src/frame/mod.rs:957:36:
should not fail: SchemaMismatch(ErrString("cannot extend/append Int8 with Int16"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/workspaces/podium/packages/podium-configs/.venv/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 1708, in collect
return wrap_df(ldf.collect())
^^^^^^^^^^^^^
pyo3_runtime.PanicException: should not fail: SchemaMismatch(ErrString("cannot extend/append Int8 with Int16"))
Log output
No response
Issue description
When different partitions of a parquet dataset have different "shrunk" dtypes (e.g. different size ints), then the shrink_dtype
can't reconcile things when called with scan_parquet
.
Expected behavior
I would expect read_parquet
and scan_parquet
to lead to the same outcome. That is, shrink the column to int16
.
Installed versions
>>> pl.show_versions()
--------Version info---------
Polars: 0.20.21
Index type: UInt32
Platform: Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.31
Python: 3.11.4 (main, Jun 7 2023, 18:32:58) [GCC 10.2.1 20210110]
----Optional dependencies----
adbc_driver_manager: <not installed>
cloudpickle: <not installed>
connectorx: <not installed>
deltalake: 0.16.4
fastexcel: <not installed>
fsspec: 2024.3.1
gevent: <not installed>
hvplot: <not installed>
matplotlib: <not installed>
nest_asyncio: <not installed>
numpy: 1.26.4
openpyxl: <not installed>
pandas: 2.2.2
pyarrow: 15.0.2
pydantic: 2.7.0
pyiceberg: <not installed>
pyxlsb: <not installed>
sqlalchemy: 1.4.52
xlsx2csv: <not installed>
xlsxwriter: <not installed>```
</details>
I've had the same problem with pl.all().shrink_dtype()
without scan_parquet, just with a LazyFrame and .collect(streaming=True)
: cannot extend/append Int8 with Int16
.
Don't have a minimal reproducing example yet.
I cannot reproduce this with Polars 0.20.31 (ran the provided script).