polars icon indicating copy to clipboard operation
polars copied to clipboard

Change `Series.equals` to ignore name by default

Open dctalbot opened this issue 1 year ago • 7 comments

Checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
df = pl.DataFrame(
            {
                "a": [1, 2, 3],
                "b": [1, 2, 3]
            }
        )

s1 = df['a']
s2 = df['b']

# this fails unexpectedly
assert s1.equals(s2)

Log output

No response

Issue description

It looks like Series.equals requires the header names to be equal in order to return true. This surprised me, but I'm unsure what the intended behavior is. Pandas does not take the headers into account e.g.

import pandas as pd
df = pd.DataFrame(
            {
                "a": [1, 2, 3],
                "b": [1, 2, 3]
            }
        )

s1 = df['a']
s2 = df['b']

# this passes as expected
assert s1.equals(s2)

I would have expected Polars to behave the same way.

If intended, it's probably worth clarifying this in the documentation.

Expected behavior

import polars as pl
df = pl.DataFrame(
            {
                "a": [1, 2, 3],
                "b": [1, 2, 3]
            }
        )

s1 = df['a']
s2 = df['b']

# this should pass
assert s1.equals(s2)

Installed versions

--------Version info---------
Polars:               0.20.1
Index type:           UInt32
Platform:             macOS-14.1.1-arm64-arm-64bit
Python:               3.11.7 (main, Dec 19 2023, 23:11:51) [Clang 15.0.0 (clang-1500.0.40.1)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           0.3.2
deltalake:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
matplotlib:           <not installed>
numpy:                1.26.2
openpyxl:             <not installed>
pandas:               2.1.4
pyarrow:              14.0.2
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.23
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

dctalbot avatar Dec 24 '23 02:12 dctalbot

Agreed - if you want to check whether the names are equal as well, you can test for a.name == b.name and a.equals(b).

Wainberg avatar Dec 24 '23 02:12 Wainberg

There is also the polars.testing module which has several parameters for customizing behaviour.

https://pola-rs.github.io/polars/py-polars/html/reference/testing.html#asserts

>>> pl.testing.assert_series_equal(s1, s2)
# AssertionError: Series are different (name mismatch)
# [left]:  a
# [right]: b
>>> pl.testing.assert_series_equal(s1, s2, check_names=False)
>>> print("Equal!")
# Equal!

cmdlineluser avatar Dec 24 '23 02:12 cmdlineluser

Thanks for the link! I wouldn't want to depend on testing utils in my application though.

dctalbot avatar Dec 27 '23 19:12 dctalbot

It's not a bug. But I also think it makes more sense for equals to ignore the name by default. I think this is worth an update. We can include a check_name parameter that defaults to False. That way, we can deprecate before we break..

stinodego avatar Jan 04 '24 10:01 stinodego

@stinodego To preserve existing behavior and avoid a breaking change, shouldn't the default check_name be True? Depends on your versioning strategy I suppose. I can probably put together a PR for this.

dctalbot avatar Jan 16 '24 15:01 dctalbot

To do this properly, you would add a parameter check_name which defaults to None. And in the function body, you check for None inputs, throw a deprecation warning, and then set it to True.

Happy to review a PR for this 👍

stinodego avatar Jan 16 '24 20:01 stinodego

Hello! Long time lurker, (potentially) first time contributor. This type of change sounds like a good first issue to me, but I see it doesn't have that label. @stinodego do you think this is a good first issue? If so, would you be willing to assign me to it (since I can't on my own)? Or should I just make the changes without being assigned.

jjmarchewitz avatar Jan 25 '24 15:01 jjmarchewitz

I did some exploring and I have a question. Why is this operator's implementation located in a test utilities file? crates/polars-core/src/testing.rs equals_missing(). Also, what does _missing mean in the title?

jjmarchewitz avatar Jan 27 '24 02:01 jjmarchewitz

@jjmarchewitz

.eq_missing() behaves differently with regards to "missing values" i.e. None/null

pl.Series([1, None, 3]).eq( pl.Series([1, 2, 3]) )
# shape: (3,)
# Series: '' [bool]
# [
#     true
#     null # <- null is propagated
#     true
# ]
pl.Series([1, None, 3]).eq_missing( pl.Series([1, 2, 3]) )
# shape: (3,)
# Series: '' [bool]
# [
#     true
#     false 
#     true
# ]
pl.Series([1, None, 3]).eq_missing( pl.Series([1, None, 3]) )
# shape: (3,)
# Series: '' [bool]
# [
#     true
#     true 
#     true
# ]

There doesn't seem to be doc entries for the Series API, but: https://docs.pola.rs/docs/python/dev/reference/expressions/api/polars.Expr.eq_missing.html#polars.Expr.eq_missing should help.

cmdlineluser avatar Jan 27 '24 02:01 cmdlineluser

It could be a good first issue, depends on your skillset!

Note that the change must be made in Rust. Deprecation must happen on the Python side.

stinodego avatar Jan 27 '24 05:01 stinodego