polars
polars copied to clipboard
[Python] Expose all operators Expr implements as methods
Describe your feature request
Currently, Polars only exposes some operators as methods of Expr. I propose exposing all of them as methods of Expr
Advantages:
- consistent docs, all operators have corresponding functions which document the types.
- Being able to find the
pow
function in docs but none of the other operators caused me no end of confusion.
- Being able to find the
- can be more readable when used in code, especially for longer chains
Additionally, it might make sense to use this as a way to document the operators and what arguments they can take as i was unable to find anything about that.
Currently implemented operators, and their method version if present:
dunder operator | method |
---|---|
__invert__ |
self.is_not |
__xor__ |
|
__rxor__ |
|
__and__ |
|
__rand__ |
|
__or__ |
|
__ror__ |
|
__add__ |
|
__radd__ |
|
__sub__ |
|
__rsub__ |
|
__mul__ |
|
__rmul__ |
|
__truediv__ |
|
__rtruediv__ |
|
__floordiv__ |
|
__rfloordiv__ |
|
__rmod__ |
|
__mod__ |
|
__pow__ |
self.pow |
__rpow__ |
|
__ge__ |
self.gt_eq |
__le__ |
self.lt_eq |
__eq__ |
self.eq |
__ne__ |
self.neq |
__lt__ |
self.lt |
__gt__ |
self.gt |
__neg__ |
|
Edit: it would probably make sense to do the same for DataFrame and Series operators, not just Expr
This would create a redundancy and would create differences in how users would write polars queries, which I want to keep to a minimum.
I think I will even follow up with deprecationg the comparisson methods as they go against this principle. (I forgot we have those).
Maybe we could document the dunders
we implement to make it better discoverable?
Im really not a fan of the way using operators makes queries look, so while understandable this isn't great to hear.
I dont really think that writing queries different ways would be more of an issue than it already is, after all i can already use the dunders directly to write the same query without operators - they're just ugly.
I dont really think that writing queries different ways would be more of an issue than it already is, after all i can already use the dunders directly to write the same query without operators - they're just ugly.
Yep, but that would be frowned upon.
I feel that it is a matter of taste. I think col(a) + col(b)
is more readable than col(a).add(col(b))
so I want to nudge into that direction.
Given my earlier point that I want to keep redundancy in operators to a minimum, I have to choose one.
And then I go for my taste, as I have to look at it most. :)
Given the many requests for this, I am willing to accept a PR that implements those on the expressions.
I would also be happy about an implementation.
But I can also understand your objection @ritchie46
I think for very simple cases you are right right that col(a) + col(b)
is cleaner than col(a).add(col(b))
For more complex applications, which are mostly present in practice, I see it like @laundmo
Example:
import polars as pl
df = pl.DataFrame({
"age": [20, 45, 33, 21, 55],
"height": [1.80, 1.90, 1.70, 1.65, 1.80],
"weight": [80, 90, 70, 65, 80],
})
# polars
(
df
.filter(
((pl.col("age") % 10) != 0) &
(pl.col("height") > 1.75) &
(
((pl.col("weight") + 10) > 80) |
(pl.col("weight") < 70)
)
)
)
# pandas
(
df[
df["age"].mod(10).ne(0) &
df["height"].gt(1.75) &
(
df["weight"].add(10).gt(80) |
df["weight"].lt(70)
)
]
)
With an implementation, however, we would have to think about the naming.
The suggestions above do not correspond to those from pandas/dunder!
I think we should stick with the dunder/pandas syntax (ge
, gt
, eq
, ne
) instead of gt_eq
, gt
, eq
, neq
.
@ritchie46 I think if you have only one expression, it doesn't matter, but it does break patterns of your code when: e.g., this example in the doc:
df.select(
[
pl.sum("nrs"),
pl.col("names").sort(),
pl.col("names").first().alias("first name"),
(pl.mean("nrs") * 10).alias("10xnrs"),
]
)
I found the code could be more clear if we can keep the pattern and do this:
df.select(
[
pl.sum("nrs"),
pl.col("names").sort(),
pl.col("names").first().alias("first name"),
pl.mean("nrs").mul(10).alias("10xnrs"),
]
)
Or maybe have a method .do()
to take any math operations. For example, .do(+5)
, do(*8)
, so:
df.select(
[
pl.sum("nrs"),
pl.col("names").sort(),
pl.col("names").first().alias("first name"),
pl.mean("nrs").do(*10).alias("10xnrs"),
]
)
I would also argue that the code could also be a bit messy under the current implementation for a complex calculation since the only way to define order of operations is via ()
, for example:
df.select(
[
pl.sum("nrs"),
pl.col("names").sort(),
pl.col("names").first().alias("first name"),
(((pl.mean("nrs") - pl.col("nrs"))*10 + 10) / (pl.col('nrs')*100)).alias("10xnrs"),
]
)
If we could have a div()
method, we can at least break it into two parts I guess. I really hope you could please give it a consideration to add some math methods.
Could you make a PR for this?
@ritchie46 would love to, but it's really beyond my technical skill. 😅. btw, shouldn't this be tracked by a separate issue? I think the op was asking something different.
I'll take care of this one.