polars
polars copied to clipboard
`__getstate__` of sliced string Series keeps reference to original Series values.
Checks
- [X] I have checked that this issue has not already been reported.
- [X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
import string
import random
import polars as pl
N = 100
s = pl.Series([
f"|{x}|" + "".join(random.sample(string.ascii_letters, 20)) for x in range(N)
])
# Verify that end marker is in full serialized state
end_row_marker = f"|{N - 1}|".encode()
original_state = s.__getstate__()
print(f"{len(original_state)=}")
assert end_row_marker in original_state
# take a 1 length slice
sliced = s.head(1)
# create an equivalent copy of the slice
good = pl.Series(sliced.to_list())
assert sliced.equals(good)
# validate the good case first.
good_state = good.__getstate__()
print(f"{len(good_state)=}")
# output state should only include marker for first row |0|
assert end_row_marker not in good_state
# Sliced should be equivalent of good_state
sliced_state = sliced.__getstate__()
print(f"{len(sliced_state)=}")
# FAIL: still includes data for last row.
assert end_row_marker not in sliced_state
Log output
No response
Issue description
Basically re-opening https://github.com/pola-rs/polars/issues/13972.
The issue is the same as the original. For certain string series a sliced version of it will still serialize the original Series values.
The output from the above script shows that while the sliced version is less than the original, it is still much larger the sliced values.
# OUTPUT:
# len(original_state)=4344
# len(good_state)=440
# len(sliced_state)=2808
If you look at the sliced_state, you can see that the original N values still exists.
Expected behavior
The sliced_state should be similarly sized to good_state
Installed versions
--------Version info---------
Polars: 0.20.16
Index type: UInt32
Platform: Linux-6.8.1-arch1-1-x86_64-with-glibc2.39
Python: 3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:50:58) [GCC 12.3.0]
----Optional dependencies----
adbc_driver_manager: <not installed>
cloudpickle: 3.0.0
connectorx: <not installed>
deltalake: <not installed>
fastexcel: <not installed>
fsspec: 2024.2.0
gevent: <not installed>
hvplot: 0.9.2.post8+g4cb29ba
matplotlib: 3.8.3
numpy: 1.26.4
openpyxl: 3.1.2
pandas: 3.0.0.dev0+432.g5bcc7b7077
pyarrow: 16.0.0.dev339+g3a6c55a12.d20240320
pydantic: 1.10.13
pyiceberg: <not installed>
pyxlsb: <not installed>
sqlalchemy: <not installed>
xlsx2csv: <not installed>
xlsxwriter: <not installed>