polars icon indicating copy to clipboard operation
polars copied to clipboard

[Python] Expose all operators Expr implements as methods

Open laundmo opened this issue 2 years ago • 5 comments

Describe your feature request

Currently, Polars only exposes some operators as methods of Expr. I propose exposing all of them as methods of Expr

Advantages:

  • consistent docs, all operators have corresponding functions which document the types.
    • Being able to find the pow function in docs but none of the other operators caused me no end of confusion.
  • can be more readable when used in code, especially for longer chains

Additionally, it might make sense to use this as a way to document the operators and what arguments they can take as i was unable to find anything about that.

Currently implemented operators, and their method version if present:

dunder operator method
__invert__ self.is_not
__xor__
__rxor__
__and__
__rand__
__or__
__ror__
__add__
__radd__
__sub__
__rsub__
__mul__
__rmul__
__truediv__
__rtruediv__
__floordiv__
__rfloordiv__
__rmod__
__mod__
__pow__ self.pow
__rpow__
__ge__ self.gt_eq
__le__ self.lt_eq
__eq__ self.eq
__ne__ self.neq
__lt__ self.lt
__gt__ self.gt
__neg__

Edit: it would probably make sense to do the same for DataFrame and Series operators, not just Expr

laundmo avatar Aug 02 '22 13:08 laundmo

This would create a redundancy and would create differences in how users would write polars queries, which I want to keep to a minimum.

I think I will even follow up with deprecationg the comparisson methods as they go against this principle. (I forgot we have those).

Maybe we could document the dunders we implement to make it better discoverable?

ritchie46 avatar Aug 02 '22 13:08 ritchie46

Im really not a fan of the way using operators makes queries look, so while understandable this isn't great to hear.

I dont really think that writing queries different ways would be more of an issue than it already is, after all i can already use the dunders directly to write the same query without operators - they're just ugly.

laundmo avatar Aug 02 '22 13:08 laundmo

I dont really think that writing queries different ways would be more of an issue than it already is, after all i can already use the dunders directly to write the same query without operators - they're just ugly.

Yep, but that would be frowned upon.

I feel that it is a matter of taste. I think col(a) + col(b) is more readable than col(a).add(col(b)) so I want to nudge into that direction.

Given my earlier point that I want to keep redundancy in operators to a minimum, I have to choose one.

And then I go for my taste, as I have to look at it most. :)

ritchie46 avatar Aug 03 '22 08:08 ritchie46

Given the many requests for this, I am willing to accept a PR that implements those on the expressions.

ritchie46 avatar Nov 19 '22 11:11 ritchie46

I would also be happy about an implementation. But I can also understand your objection @ritchie46 I think for very simple cases you are right right that col(a) + col(b) is cleaner than col(a).add(col(b)) For more complex applications, which are mostly present in practice, I see it like @laundmo Example:

import polars as pl

df = pl.DataFrame({
    "age": [20, 45, 33, 21, 55],
    "height": [1.80, 1.90, 1.70, 1.65, 1.80],
    "weight": [80, 90, 70, 65, 80],

})

# polars
(
    df
    .filter(
        ((pl.col("age") % 10) != 0) &
        (pl.col("height") > 1.75) &
        (
            ((pl.col("weight") + 10) > 80) |
            (pl.col("weight") < 70)
        )
    )
)

# pandas
(
    df[
        df["age"].mod(10).ne(0) &
        df["height"].gt(1.75) &
        (
            df["weight"].add(10).gt(80) |
            df["weight"].lt(70)
        )
    ]
)

With an implementation, however, we would have to think about the naming. The suggestions above do not correspond to those from pandas/dunder! I think we should stick with the dunder/pandas syntax (ge, gt, eq, ne) instead of gt_eq, gt, eq, neq.

Julian-J-S avatar Nov 19 '22 15:11 Julian-J-S

@ritchie46 I think if you have only one expression, it doesn't matter, but it does break patterns of your code when: e.g., this example in the doc:

df.select(
    [
        pl.sum("nrs"),
        pl.col("names").sort(),
        pl.col("names").first().alias("first name"),
        (pl.mean("nrs") * 10).alias("10xnrs"),
    ]
)

I found the code could be more clear if we can keep the pattern and do this:

df.select(
    [
        pl.sum("nrs"),
        pl.col("names").sort(),
        pl.col("names").first().alias("first name"),
        pl.mean("nrs").mul(10).alias("10xnrs"),
    ]
)

Or maybe have a method .do() to take any math operations. For example, .do(+5), do(*8), so:

df.select(
    [
        pl.sum("nrs"),
        pl.col("names").sort(),
        pl.col("names").first().alias("first name"),
        pl.mean("nrs").do(*10).alias("10xnrs"),
    ]
)

I would also argue that the code could also be a bit messy under the current implementation for a complex calculation since the only way to define order of operations is via (), for example:

df.select(
    [
        pl.sum("nrs"),
        pl.col("names").sort(),
        pl.col("names").first().alias("first name"),
        (((pl.mean("nrs") - pl.col("nrs"))*10 + 10) / (pl.col('nrs')*100)).alias("10xnrs"),
    ]
)

If we could have a div() method, we can at least break it into two parts I guess. I really hope you could please give it a consideration to add some math methods.

stevenlis avatar Mar 13 '23 02:03 stevenlis

Could you make a PR for this?

ritchie46 avatar Mar 13 '23 08:03 ritchie46

@ritchie46 would love to, but it's really beyond my technical skill. 😅. btw, shouldn't this be tracked by a separate issue? I think the op was asking something different.

stevenlis avatar Mar 13 '23 13:03 stevenlis

I'll take care of this one.

alexander-beedie avatar Mar 20 '23 06:03 alexander-beedie