polars
polars copied to clipboard
Big file with dtype=pl.List(pl.UInt32) are written to parquet incorrectly (row_group_size)
Polars version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of Polars.
Issue description
On really precise conditions, apparently when row_group_size is the third of the Dataframe height, the pl.List(pl.UInt32) type will not be written properly
Reproducible example
import polars as pl
df = pl.Series('a', [*[None]*900_000, [1, 2]], dtype=pl.List(pl.UInt32)).to_frame()
print(df.tail(1)) # all the output should be the same as this one
# shape: (1, 1)
# ┌──────────────┐
# │ a │
# │ --- │
# │ list[u32] │
# ╞══════════════╡
# │ [1, 2] │
# └──────────────┘
df.write_parquet('test.pq')
print(pl.read_parquet('test.pq').tail(1)) # not the same (not working)
# shape: (1, 1)
# ┌──────────────┐
# │ a │
# │ --- │
# │ list[u32] │
# ╞══════════════╡
# │ [null, null] │
# └──────────────┘
df.write_parquet('test.pq', row_group_size=300_000)
print(pl.read_parquet('test.pq').tail(1)) # the same (working)
# shape: (1, 1)
# ┌──────────────┐
# │ a │
# │ --- │
# │ list[u32] │
# ╞══════════════╡
# │ [1, 2] │
# └──────────────┘
df.write_parquet('test.pq', row_group_size=300_001)
print(pl.read_parquet('test.pq').tail(1)) # not the same (not working)
# shape: (1, 1)
# ┌──────────────┐
# │ a │
# │ --- │
# │ list[u32] │
# ╞══════════════╡
# │ [null, null] │
# └──────────────┘
Expected behavior
>> import polars as pl
>> df = pl.Series('a', [*[None]*900_000, [1, 2]], dtype=pl.List(pl.UInt32)).to_frame()
>> df
shape: (2, 1)
┌───────────┐
│ a │
│ --- │
│ list[u32] │
╞═══════════╡
│ null │
│ [1, 2] │
└───────────┘
>> df.write_parquet('test.pq')
>> pl.read_parquet('test.pq').tail(2)
shape: (2, 1)
┌───────────┐
│ a │
│ --- │
│ list[u32] │
╞═══════════╡
│ null │
│ [1, 2] │
└───────────┘
Installed versions
---Version info---
Polars: 0.15.15
Index type: UInt32
Platform: macOS-13.1-arm64-arm-64bit
Python: 3.10.9 (main, Dec 15 2022, 17:11:09) [Clang 14.0.0 (clang-1400.0.29.202)]
---Optional dependencies---
pyarrow: 10.0.1
pandas: 1.5.2
numpy: 1.22.4
fsspec: 2022.11.0
connectorx: 0.3.1
xlsx2csv: <not installed>
matplotlib: <not installed>
Thanks for the issue report. In the meantime you can circumvent the issue by writing with use_pyarrow=True
.
This has been fixed.