polars icon indicating copy to clipboard operation
polars copied to clipboard

`__getstate__` of sliced string Series keeps reference to original Series values.

Open dalejung opened this issue 1 year ago • 0 comments

Checks

  • [X] I have checked that this issue has not already been reported.
  • [X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import string
import random
import polars as pl


N = 100
s = pl.Series([
    f"|{x}|" + "".join(random.sample(string.ascii_letters, 20)) for x in range(N)
])

# Verify that end marker is in full serialized state
end_row_marker = f"|{N - 1}|".encode()
original_state = s.__getstate__()
print(f"{len(original_state)=}")
assert end_row_marker in original_state

# take a 1 length slice
sliced = s.head(1)

# create an equivalent copy of the slice
good = pl.Series(sliced.to_list())
assert sliced.equals(good)

# validate the good case first.
good_state = good.__getstate__()
print(f"{len(good_state)=}")
# output state should only include marker for first row |0|
assert end_row_marker not in good_state

# Sliced should be equivalent of good_state
sliced_state = sliced.__getstate__()
print(f"{len(sliced_state)=}")
# FAIL: still includes data for last row.
assert end_row_marker not in sliced_state

Log output

No response

Issue description

Basically re-opening https://github.com/pola-rs/polars/issues/13972.

The issue is the same as the original. For certain string series a sliced version of it will still serialize the original Series values.

The output from the above script shows that while the sliced version is less than the original, it is still much larger the sliced values.

# OUTPUT:
# len(original_state)=4344
# len(good_state)=440
# len(sliced_state)=2808

If you look at the sliced_state, you can see that the original N values still exists.

sliced_state

Expected behavior

The sliced_state should be similarly sized to good_state

Installed versions

--------Version info---------
Polars:               0.20.16
Index type:           UInt32
Platform:             Linux-6.8.1-arch1-1-x86_64-with-glibc2.39
Python:               3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:50:58) [GCC 12.3.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          3.0.0
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.2.0
gevent:               <not installed>
hvplot:               0.9.2.post8+g4cb29ba
matplotlib:           3.8.3
numpy:                1.26.4
openpyxl:             3.1.2
pandas:               3.0.0.dev0+432.g5bcc7b7077
pyarrow:              16.0.0.dev339+g3a6c55a12.d20240320
pydantic:             1.10.13
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

dalejung avatar Mar 23 '24 01:03 dalejung