polars icon indicating copy to clipboard operation
polars copied to clipboard

feat(python): Allow `pl.element()` in `Series.filter`

Open mcrumiller opened this issue 1 year ago • 4 comments

Closes #13772.

This allows a Series to self-reference via pl.element() in filter:

>>> pl.Series([1, 2, 3]).filter(pl.element() < 3)
shape: (2,)
Series: '' [i64]
[
        1
        2
]

Note that we now go entirely through the expression system for Series.filter, so I have removed the associated rust binding to the underlying Series.

mcrumiller avatar Aug 19 '24 16:08 mcrumiller

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 80.42%. Comparing base (41d3048) to head (663d3f8). Report is 148 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #18270      +/-   ##
==========================================
- Coverage   80.42%   80.42%   -0.01%     
==========================================
  Files        1492     1492              
  Lines      198675   198666       -9     
  Branches     2841     2840       -1     
==========================================
- Hits       159785   159772      -13     
- Misses      38365    38369       +4     
  Partials      525      525              

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Aug 19 '24 16:08 codecov[bot]

This looks nicer! Thanks for accepting the suggestions!

liufeimath avatar Aug 23 '24 15:08 liufeimath

@orlp can you help to review this PR? Thanks!

liufeimath avatar Aug 26 '24 15:08 liufeimath

@MarcoGorelli Hi Marco, sorry to interrupt but can you help to review this PR? I think it is a valuable change, and it's a concise one which won't take much review time. I see this PR not getting reviewed for quite a few days, hence asking.

liufeimath avatar Aug 28 '24 15:08 liufeimath

Is this PR actively reviewed? I see it hang for quite a while. Can the owners take a look? I think this is a valuable and concise change which won't take much review time.

liufeimath avatar Sep 09 '24 21:09 liufeimath

This is not something we can support as is. This PR diverges the Series behavior from that of Expr.

I am also not a fan of how element is implemented on the Polars side and want to fix that first.

ritchie46 avatar Sep 10 '24 07:09 ritchie46

This is not something we can support as is. This PR diverges the Series behavior from that of Series.

I am also not a fan of how element is implemented on the Polars side and want to fix that first.

Can you elaborate on the meaning of "diverges the Series behavior from that of Series" ?

liufeimath avatar Sep 10 '24 23:09 liufeimath

I meant Expr. Edited.

ritchie46 avatar Sep 11 '24 06:09 ritchie46

I meant Expr. Edited.

I'm still confused. Aren't they already "diverged" because their APIs are already quite different anyways? E.g. Expr.filter is for DataFrame and it works across multiple columns, which behaves more like DataFrame.filter:

import polars as pl

df = pl.DataFrame(
   {
       "a": [1, 2, 3],
       "b": [4, 5, 6],
   }
)

print(df)
# shape: (3, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 4   │
# │ 2   ┆ 5   │
# │ 3   ┆ 6   │
# └─────┴─────┘

print(df.select(pl.col("a").filter(pl.col("b") == 5)))
# shape: (1, 1)
# ┌─────┐
# │ a   │
# │ --- │
# │ i64 │
# ╞═════╡
# │ 2   │
# └─────┘

while Series only has one column and the filter only works for that column. In my opinion, the change proposed by this PR indeed brings the two signatures closer, rather than "more diverged". The use of pl.element in Series.filter is also more aligned and consistent with how it is used in Series.list.eval. It is also more consistent with the high level API and structural design of polars: Series, in terms of elementwise operations, should be considered as a special case of an unnamed single-column DataFrame, with a flattened shape (i.e. (n, 1) => (n,) ). Isn't that the right way to unify most of the DataFrame <=> Series APIs? Based on these reasons, I think this PR promotes unification rather than divergence.

liufeimath avatar Sep 11 '24 06:09 liufeimath

@ritchie46 Can you give an example to illustrate how it (this change) diverges Series.filter from Expr.filter, and how they are "not diverged" as of the current status?

liufeimath avatar Sep 11 '24 07:09 liufeimath

@ritchie46 In terms of the current implementation of pl.element (a syntax sugar for an unnamed column): I understand that it's not ideal and it goes slightly beyond the scope of "an element of some name-agnostic 1D-array-like object (list, Series, etc)". Do you have any idea in mind to make it better and more strict?

liufeimath avatar Sep 11 '24 07:09 liufeimath

@ritchie46 Can you give an example to illustrate how it (this change) diverges Series.filter from Expr.filter, and how they are "not diverged" as of the current status?

It doesn't work in expressions, but only in Series. I will close this as it isn't the way forward and the architecture of pl.element isn't correct yet.

Do you have any idea in mind to make it better and more strict?

I don't know yet, I am not working on this atm.

ritchie46 avatar Sep 11 '24 08:09 ritchie46