polars icon indicating copy to clipboard operation
polars copied to clipboard

Big file with dtype=pl.List(pl.UInt32) are written to parquet incorrectly (row_group_size)

Open Vincenthays opened this issue 2 years ago • 1 comments

Polars version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of Polars.

Issue description

On really precise conditions, apparently when row_group_size is the third of the Dataframe height, the pl.List(pl.UInt32) type will not be written properly

Reproducible example

import polars as pl

df = pl.Series('a', [*[None]*900_000, [1, 2]], dtype=pl.List(pl.UInt32)).to_frame()
print(df.tail(1)) # all the output should be the same as this one
# shape: (1, 1)
# ┌──────────────┐
# │ a            │
# │ ---          │
# │  list[u32]   │
# ╞══════════════╡
# │ [1, 2]       │
# └──────────────┘

df.write_parquet('test.pq')
print(pl.read_parquet('test.pq').tail(1)) # not the same (not working)
# shape: (1, 1)
# ┌──────────────┐
# │ a            │
# │ ---          │
# │  list[u32]   │
# ╞══════════════╡
# │ [null, null] │
# └──────────────┘

df.write_parquet('test.pq', row_group_size=300_000)
print(pl.read_parquet('test.pq').tail(1)) # the same (working)
# shape: (1, 1)
# ┌──────────────┐
# │ a            │
# │ ---          │
# │  list[u32]   │
# ╞══════════════╡
# │ [1, 2]       │
# └──────────────┘

df.write_parquet('test.pq', row_group_size=300_001)
print(pl.read_parquet('test.pq').tail(1)) # not the same (not working)
# shape: (1, 1)
# ┌──────────────┐
# │ a            │
# │ ---          │
# │  list[u32]   │
# ╞══════════════╡
# │ [null, null] │
# └──────────────┘

Expected behavior

>> import polars as pl
>> df = pl.Series('a', [*[None]*900_000, [1, 2]], dtype=pl.List(pl.UInt32)).to_frame()
>> df
shape: (2, 1)
┌───────────┐
│ a         │
│ ---       │
│ list[u32] │
╞═══════════╡
│ null      │
│ [1, 2]    │
└───────────┘
>> df.write_parquet('test.pq')
>> pl.read_parquet('test.pq').tail(2)
shape: (2, 1)
┌───────────┐
│ a         │
│ ---       │
│ list[u32] │
╞═══════════╡
│ null      │
│ [1, 2]    │
└───────────┘

Installed versions

---Version info---
Polars: 0.15.15
Index type: UInt32
Platform: macOS-13.1-arm64-arm-64bit
Python: 3.10.9 (main, Dec 15 2022, 17:11:09) [Clang 14.0.0 (clang-1400.0.29.202)]
---Optional dependencies---
pyarrow: 10.0.1
pandas: 1.5.2
numpy: 1.22.4
fsspec: 2022.11.0
connectorx: 0.3.1
xlsx2csv: <not installed>
matplotlib: <not installed>

Vincenthays avatar Jan 17 '23 16:01 Vincenthays

Thanks for the issue report. In the meantime you can circumvent the issue by writing with use_pyarrow=True.

ritchie46 avatar Jan 18 '23 09:01 ritchie46

This has been fixed.

stinodego avatar Jan 18 '24 21:01 stinodego