explorer icon indicating copy to clipboard operation
explorer copied to clipboard

Across APIs

Open josevalim opened this issue 2 years ago • 6 comments

@cigrainger and I were discussing ideas to improve the dataframe API. One of the ideas is to bring dplyr's recent across functionality, so we can do:

|> DF.filter(across(sepal_width < 10))

Or even to filter across multiple columns:

|> DF.filter(across(col <- [sepal_width, petal_width], col < 10))

However, @cigrainger mentioned that we may want to just make DF.filter/2 a macro, so we can write this:

|> DF.filter(col <- [sepal_width, petal_width], col < 10)

The following operations need to support across (or will need to be macros): arrange, filter, summarize, and mutate. The goal is to discuss if we should go with explicit across OR use macros.

across or macros

The benefit of across is Explorer will have a single macro, across, which we will get from ìmport Explorer. However, we will need to type across` almost every time we use one of the operations above, which would make them more verbose.

Macros solve the verbosity problem, at the cost of adding more magic. Also, for every macro API, we should have a non-macro version. However, I think we can easily tackle this by having a _with suffix convention. filter/2 is a macro, that supports the across functionality. filter_with/2 is the version that supports anonymous functions. Then we can have similar with mutate and mutate_with and so on.

@cigrainger, do you have any further thoughts on the preferred approach?

josevalim avatar May 07 '22 15:05 josevalim

Oh I really like the _with suffix idea! I think having the macro version of filter/mutate/summarise means that eager polars can use the lazy API behind the scenes to permit a more flexible API. This is particularly important for summarise, which right now is really polars-specific and lacking functionality. I also think this will be important for a SQL backend to avoid the need to copy from local to source.

What do I mean by this?

|> mutate(new_column: fn df -> Series.greater(df["column_a "], df["column_b"]) end)

This currently calculates the value of the callback eagerly. What I'd like to be able to do is always capture the column values and turn them into lazy series (expressions). This would mean the backend is getting a uniform value (always receiving a lazy series) that it can do with what it pleases.

I'm particularly excited about summarise because it would mean we can use the lazy groupby functionality. This could yield an API like this, even for eager:

|> DF.group_by("class")
|> DF.summarise(
  mean_sepal_width: Series.mean(sepal_width), 
  max_combined: Series.max(sepal_width + sepal_length), 
  sum_over_n: Series.sum(sepal_width > 10)
)

It means we can use the Series API arbitrarily, as long as we return a single value, instead of being locked to Polars's aggregate functions (which may not be the same for all backends).

So for me it's not just the reduction in verbosity (which is nice), but also the ability to always use contextual instructions instead of realising a series immediately.

I would think we'd still want across as that's a really tidy API for working with multiple columns at a time.

For example, I'd like to be able to write:

|> filter(sepal_width > 10)

as well as

|> DF.filter(across(col <- [sepal_width, petal_width], col < 10))

cigrainger avatar May 07 '22 20:05 cigrainger

but also the ability to always use contextual instructions instead of realising a series immediately.

To clarify, this will always be true for both APIs! We can change DF.filter(..., fn df -> ... end) (or DF.filter_with) so the dataframe received in the function always yields a "LazySeries". Which is why I think we should split filter in two functions (filter and mask - but you may have better ideas for the second name, see #224).

The same for mutate:

DF.summarize(df, fn df -> [width_avg: Series.average(df["width"])] end)

df will emit lazy series so we can compute all values properly.

So the two approaches discussed here should only be a style-related choice. Although I also like your idea of supporting the macro syntax and using across only for when folks want to traverse multiple columns. If we go this route, my only question is where should the across macro live.

josevalim avatar May 07 '22 21:05 josevalim

I didn't even consider it, but I see your point that we can just convert to lazy before applying the callback on the back end and that will provide lazy series. I also like requiring the second argument as a single callback that returns a keyword list instead of a keyword list of callbacks (or raw values).

The raw value thing is a challenge that you've identified in #224. I think _with suffixes make plenty of sense here and can be uniform across the verbs (instead of mask or other verb-specific functions). Like arrange_with. It makes sense to me that we're doing this verb "with" this realised Series/list.

So what are the other drawbacks of using a macro instead of a callback? On the one hand, I think Elixir syntax is "tidy enough"... on the other, I think vectorised infix operators are more natural as are unwrapped column names.

Compare:

DF.summarize(df, fn df -> [width_avg: Series.average(df["width"])] end)

to

DF.summarize(df, width_avg: Series.average(width))

Honestly, I'm wondering if we need to settle on this right away. I think there would be plenty of immediate benefit in locking down the non-macro API:

  1. Replace keyword list of callbacks/values to a single callback returning a keyword list
  2. Discourage raw/realised Series/list use in filter, arrange, et al by only taking a callback and providing filter_with and arrange_with instead (could also be _by).
  3. Change the backend to use lazy under the hood with the callbacks
  4. Change the group_by/summarise API to more closely resemble mutate
  5. Assess whether we should use the macro syntax

One thing I'll note is that providing across without using the macro syntax elsewhere may be confusing -- folks might think they can use the macro syntax when they can't, and it confuses the point of across which is to traverse multiple columns. I think we should basically do both or neither. I lean towards both, but also think it may be a moving target before we do the steps above.

cigrainger avatar May 09 '22 00:05 cigrainger

Honestly, I'm wondering if we need to settle on this right away. I think there would be plenty of immediate benefit in locking down the non-macro API:

I agree we should do this asap because this change is orthogonal to across. This discussion is mostly about choosing the API, not the functionality, and I think the functionality could have immediate use!

I also agree with our list steps too. Until we have the macro versions, I think we should only have filter_with, arrange_with, etc. For functions that receive realized series, maybe they should have a different name, especially because the functions that work with realized series most likely won't work on lazy backends? Are there any other functions you believe we should expose the realized series API besides filter? If so, we can continue the discussion in #224.

josevalim avatar May 09 '22 06:05 josevalim

I think we should only have filter_with, arrange_with, etc

Actually, keep the existing APIs around, introduce the new ones, and then replace the old ones by macros when ready.

josevalim avatar May 09 '22 06:05 josevalim

For completeness, the operations that will be affected by this are: mutate, summarize, filter, arrange, distinct.

josevalim avatar Jun 14 '22 16:06 josevalim

Only one last step missing. Discussing it #507.

josevalim avatar Feb 16 '23 22:02 josevalim