polars icon indicating copy to clipboard operation
polars copied to clipboard

Scan parquet fails when shrinking `dtype` as part of the lazy operations

Open sambrilleman opened this issue 10 months ago • 1 comments

Checks

  • [X] I have checked that this issue has not already been reported.
  • [X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

data = pl.DataFrame({'a': [0, 1, 1000, 2000]})

data[0:2].write_parquet('part1.parquet')
data[2:4].write_parquet('part2.parquet')

# read_parquet works with shrinking dtypes

data_read = pl.read_parquet('part*.parquet')

print(data_read)

data_read_shrink = pl.read_parquet('part*.parquet').select(pl.all().shrink_dtype())

print(data_read_shrink)

# scan_parquet does not work with shrinking dtypes, no matter where we collect

data_scan = pl.scan_parquet('part*.parquet').collect()

print(data_scan)

data_scan_shrink_greedy = pl.scan_parquet('part*.parquet').collect().select(pl.all().shrink_dtype())
data_scan_shrink_lazy = pl.scan_parquet('part*.parquet').select(pl.all().shrink_dtype()).collect()

The generated errors are on the two scan_parquet lines:

>>> data = pl.DataFrame({'a': [0, 1, 1000, 2000]})
>>> 
>>> data[0:2].write_parquet('part1.parquet')
>>> data[2:4].write_parquet('part2.parquet')
>>> 
>>> # read_parquet works with shrinking dtypes
>>> 
>>> data_read = pl.read_parquet('part*.parquet')
>>> 
>>> print(data_read)
shape: (4, 1)
┌──────┐
│ a    │
│ ---  │
│ i64  │
╞══════╡
│ 0    │
│ 1    │
│ 1000 │
│ 2000 │
└──────┘
>>> 
>>> data_read_shrink = pl.read_parquet('part*.parquet').select(pl.all().shrink_dtype())
>>> 
>>> print(data_read_shrink)
shape: (4, 1)
┌──────┐
│ a    │
│ ---  │
│ i16  │
╞══════╡
│ 0    │
│ 1    │
│ 1000 │
│ 2000 │
└──────┘
>>> 
>>> # scan_parquet does not work with shrinking dtypes, no matter where we collect
>>> 
>>> data_scan = pl.scan_parquet('part*.parquet').collect()
>>> 
>>> print(data_scan)
shape: (4, 1)
┌──────┐
│ a    │
│ ---  │
│ i64  │
╞══════╡
│ 0    │
│ 1    │
│ 1000 │
│ 2000 │
└──────┘
>>> 
>>> data_scan_shrink_greedy = pl.scan_parquet('part*.parquet').collect().select(pl.all().shrink_dtype())
thread 'python' panicked at crates/polars-core/src/frame/mod.rs:957:36:
should not fail: SchemaMismatch(ErrString("cannot extend/append Int8 with Int16"))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/workspaces/podium/packages/podium-configs/.venv/lib/python3.11/site-packages/polars/dataframe/frame.py", line 7701, in select
    return self.lazy().select(*exprs, **named_exprs).collect(_eager=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/podium/packages/podium-configs/.venv/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 1708, in collect
    return wrap_df(ldf.collect())
                   ^^^^^^^^^^^^^
pyo3_runtime.PanicException: should not fail: SchemaMismatch(ErrString("cannot extend/append Int8 with Int16"))
>>> data_scan_shrink_lazy = pl.scan_parquet('part*.parquet').select(pl.all().shrink_dtype()).collect()
thread 'python' panicked at crates/polars-core/src/frame/mod.rs:957:36:
should not fail: SchemaMismatch(ErrString("cannot extend/append Int8 with Int16"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/workspaces/podium/packages/podium-configs/.venv/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 1708, in collect
    return wrap_df(ldf.collect())
                   ^^^^^^^^^^^^^
pyo3_runtime.PanicException: should not fail: SchemaMismatch(ErrString("cannot extend/append Int8 with Int16"))

Log output

No response

Issue description

When different partitions of a parquet dataset have different "shrunk" dtypes (e.g. different size ints), then the shrink_dtype can't reconcile things when called with scan_parquet.

Expected behavior

I would expect read_parquet and scan_parquet to lead to the same outcome. That is, shrink the column to int16.

Installed versions

>>> pl.show_versions()
--------Version info---------
Polars:               0.20.21
Index type:           UInt32
Platform:             Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.31
Python:               3.11.4 (main, Jun  7 2023, 18:32:58) [GCC 10.2.1 20210110]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            0.16.4
fastexcel:            <not installed>
fsspec:               2024.3.1
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         <not installed>
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              15.0.2
pydantic:             2.7.0
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           1.4.52
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>```

</details>

sambrilleman avatar Apr 17 '24 23:04 sambrilleman

I've had the same problem with pl.all().shrink_dtype() without scan_parquet, just with a LazyFrame and .collect(streaming=True): cannot extend/append Int8 with Int16.
Don't have a minimal reproducing example yet.

daviskirk avatar Apr 21 '24 15:04 daviskirk

I cannot reproduce this with Polars 0.20.31 (ran the provided script).

itamarst avatar Jun 05 '24 15:06 itamarst