polars icon indicating copy to clipboard operation
polars copied to clipboard

Rationalize with_column & with_columns

Open zundertj opened this issue 2 years ago • 25 comments

Problem description

Initial discussion was on Discord, moving here so everyone can join.

The DataFrame methods with_column and with_columns tend to be used quite often, as these are, together with select, the primary way to run expressions on dataframes. I have a couple of issues with these though:

  1. with_columns allows everything with_column does, except checking that only one expression is passed in. Thus it seems to me very marginal benefit to have two methods. Marginal because I dont think there is ever confusion on whether you pass in one or multiple expressions.

  2. with_column also does not support the (experimental) assignment syntax currently

pl.Config.with_columns_kwargs = True
df.with_columns(b=pl.col("a") * 2)  # works
df.with_column(b=pl.col("a") * 2)  # fails
  1. (per Ritchie's comment on Discord that they return a new dataframe): the docstrings are not clear what is copied and what not:
  • with_column: Return a new DataFrame with the column added or replaced.
  • with_columns: Add or overwrite multiple columns in a DataFrame. The latter suggests there is no new dataframe created, but there is.
  1. I find with_columns quite verbose given how often it is used. We have also opted to use pl.col and not pl.column, and pl.lit rather than pl.literal. My initial proposal was assign, but @ritchie46 's point is that this may give the impression no new dataframe is being returned, while there is. So my new proposal would be to shorten to with, except that with is a reserved keyword In Python. So maybe with_col, but that seems quite a marginal gain?

Putting it altogether:

  1. deprecate with_column
  2. rename with_columns to with_col?
  3. fix the docstring of with_columns/with_col to reflect that a new dataframe is being created

zundertj avatar Jan 08 '23 11:01 zundertj

I don't think we can. with is a python keyword.

ritchie46 avatar Jan 08 '23 11:01 ritchie46

Yep, you are right, that wont fly. Updated the post to reflect that.

zundertj avatar Jan 08 '23 11:01 zundertj

Dropping with_column in favour of with_columns is definitely a good move; in the same vein as the other abbreviations how about shortening to with_cols? Everyone's muscle memory (and autocomplete) will be halfway there already... ;)

Note: passing more than one column to with_column would look really odd, whereas passing one or more to with_columns would look pretty normal, hence favouring deprecation of the singular form.

Also, let's make the currently-experimental calling pattern allowed by default? It has proven really popular over here, and I shudder to think what would happen if it went away...😅

alexander-beedie avatar Jan 08 '23 11:01 alexander-beedie

Note that with_columns already supports the singular form, so it is not just normal, it is the current behaviour.

So I think we have:

  1. deprecate with_column
  2. fix docstring with_columns => #6122
  3. make experimental calling pattern the default; remove flag.

My hesitation with with_cols or with_col is that it does not seem to be enough of an improvement, first to break with for example Snowflake, and second to do the deprecation. I guess the latter point, given we are 0.x and can easily support this change, is not too big of a problem, but I guess I'm hoping for a better name to come up over time, which would put users through another cycle of updating code.

zundertj avatar Jan 08 '23 16:01 zundertj

My hesitation with with_cols or with_col is that it does not seem to be enough of an improvement ... would put users through another cycle of updating code.

True; it's likely to be one of the most widespread method calls too; that's a lot of search/replace without an obvious gain; probably not what you want right as polars feels like it's getting some real critical mass.

alexander-beedie avatar Jan 08 '23 16:01 alexander-beedie

If one would ever rename with_columns, then perhaps a reasonable renaming would be mutate, which would bring it in line with tidyverse in R.

pjj avatar Jan 08 '23 16:01 pjj

If one would ever rename with_columns, then perhaps a reasonable renaming would be mutate, which would bring it in line with tidyverse in R.

It's a very vague/ambiguous name though... like: what does it mutate and how? The only people who might stand a chance at guessing will be the R illuminati ;)

alexander-beedie avatar Jan 08 '23 17:01 alexander-beedie

mutate would have the same issue as assign, that it suggests the dataframe is modified rather than that a new dataframe is being returned. In that case, I would prefer assign over mutate.

zundertj avatar Jan 08 '23 18:01 zundertj

True; it's likely to be one of the most widespread method calls too; that's a lot of search/replace without an obvious gain

The most used functions should get the shortest names because you type it so often imo.

Marmeladenbrot avatar Jan 08 '23 19:01 Marmeladenbrot

I am in favor of keeping one with_columns. I don't mind a few keystrokes to keep readable english names.

I don't think it's worth a refactor. Once you are at with_c a tab does the rest.

ritchie46 avatar Jan 08 '23 19:01 ritchie46

And in truth it does actually do a good job of telling you what's going to happen, in a way that assign or mutate just don't. You're going to get the DataFrame... with columns, and those columns are going to be the ones you give it; it's very clear.

alexander-beedie avatar Jan 08 '23 19:01 alexander-beedie

Glad to see some backings today. Tough day😅

ritchie46 avatar Jan 08 '23 19:01 ritchie46

I agree that in absence of an equally clear, but shorter name, we should not change. I thought with would be that one, but that won't work in combination with Python. To be sure, for me shorter names are not about keystrokes (with_c<tab> is easy enough), but about readability. Longer names, if not more informative over their shorter counterpart, make imo code harder to follow, as more reading is required, more line breaks are required, etc.

I have raised a draft PR for deprecating with_column, see https://github.com/pola-rs/polars/pull/6128.

zundertj avatar Jan 08 '23 22:01 zundertj

Why not:

  • create an alias .wc -> .with_columns (or .wcols or whatever short)
  • expose both in the api
  • make clear in the doc that .with_columns is the primary method?

I agree with both keystrokes and readability arguments.

Or another route is to give the option to extend the DataFrame methods.

(btw, my first comment on that project => huge thanks, polars is amazing)

cbzittoun avatar Jan 11 '23 14:01 cbzittoun

Why not/ create an alias .wc -> .with_columns (or .wcols or whatever short)

@cbzittoun: Aside from "wc" meaning toilet (seriously ;) it doesn't really offer a meaningful advantage, and it adds clutter to the API surface.

Or another route is to give the option to extend the DataFrame methods.

Now that is actually supported - as long as you do it in a custom namespace:
https://pola-rs.github.io/polars/py-polars/html/reference/api.html

alexander-beedie avatar Jan 11 '23 17:01 alexander-beedie

ahah yes but there is no reason the bathroom namespace should apply here

thanks for the the other solution, that's useful

cbzittoun avatar Jan 11 '23 17:01 cbzittoun

I've a small suggestion (that could break things, though) about with_columns. Currently, this method accepts a single parameter that is either an expression or a list of expressions (i.e., df.with_columns([..., ...]) to create two new columns). I propose to use *args instead so that df.with_columns([..., ...]) can be written df.with_columns(..., ...). This has several advantages:

(1) It does not require to explicitly create a list of expressions (and prevents writing extra [ and ]); (2) It supports both the creation of a single new column or many at once (so the API for a single column won't change, and adding more columns just requires providing more parameters instead of converting the parameter to a list and extend it); (3) It's probably more Pythonic ;-) (4) Support for named expressions is really interesting (and I hope it will be enabled by default soon) since it makes the names of the new columns visible before their definition, as in pandas' .assign (using .alias means that the name can only be found at the end of the expression). By using *args instead of a list in with_columns, we increase the consistency since one can easily go from a **kwargs-based call to an *args-based call and vice-versa:

df.with_columns(
  expr1, 
  expr2, 
  expr3,
)

can be easily converted to/from:

df.with_columns(
   a=expr1, 
   b=expr2, 
   c=expr3,
)

AlexandreDecan avatar Jan 17 '23 09:01 AlexandreDecan

I also like the idea of pl.Config.with_columns_kwargs = True being enabled by default. When using black code format the code "expands". I'm not familiar with the reasoning for using lists or if changing it to **kwargs-based would be worth it, though.

arturdaraujo avatar Jan 17 '23 16:01 arturdaraujo

I'm not suggesting to go for a full (and only) kwargs-based api but for a combination of *args and **kwargs. In particular I think it would make more sense to use *args instead of a list. But for consistency, that means that select, filter, col, etc should probably be "converted" to receive *args instead of a list, which may lead to backward incompatibilities...

AlexandreDecan avatar Jan 19 '23 19:01 AlexandreDecan

@AlexandreDecan: the ergonomic downside with *args style is passing in an existing (or programmatically-generated) list of expressions into the method; you would be amazed at how easy it is for people to forget to splat them in (eg: with_columns(*expressions)) and then wander all over the code wondering where the error is.

I've designed significant APIs both ways in the past, but that tends to take users a while to really ingrain (which isn't to say I don't like it, as I currently have an API using this style at work - however, it was designed to be that way from the start, and is consistent across all interfaces). Without a compelling advantage (vs some minor niceties), I doubt we'd want to make a breaking change like that across multiple commonly-used methods without something/someone really driving it...

alexander-beedie avatar Jan 23 '23 15:01 alexander-beedie

Thanks for your answer! I understand the rationale. But in that case, I would be in favour, for consistency, not to allow the use of **kwargs and rather to allow to pass a dictionary instead of a list to support named expressions, i.e., df.with_columns({'a': expr1, 'b': expr2, 'c': expr3}) instead of df.with_coumns(a=expr1, b=expr2, c=expr3).

That way, the number of parameters is always exactly 1, and there's a parallel between passing a list of unnamed expressions and passing a dict of named expressions (this parallel could have been achieved by using *args and **kwargs, but I understand you prefer to avoid them).

AlexandreDecan avatar Jan 23 '23 15:01 AlexandreDecan

Thanks for your answer! I understand the rationale. But in that case, I would be in favour, for consistency, not to allow the use of **kwargs and rather to allow to pass a dictionary instead of a list to support named expressions, i.e.,

That would be very unergonomic - there is no reason for **kwargs style calls (an extremely common calling convention in python) to squeeze into a single parameter as a dict; I'd argue that would lose the reason for supporting this mode in the first place, which is primarily about how simple/pythonic it is ;)

alexander-beedie avatar Jan 23 '23 15:01 alexander-beedie

Perhaps less ergonomic, but consistent with your choice for not supporting *args. Anyway, I prefer to have **kwargs rather than nothing :-p

Apart from this, and going back to your previous message in which you said "the ergonomic downside with *args style is passing in an existing (or programmatically-generated) list of expressions into the method;", what about supporting all these uses cases? For example:

def with_columns(self, *args, **kwargs):
  if len(args) == 1:
    if isinstance(args[0], list):
      args = args[0]
   ...

AlexandreDecan avatar Jan 23 '23 16:01 AlexandreDecan

consistent with your choice for not supporting *args

@AlexandreDecan: I admit I still don't see the parallel, but it seems we may be going for it after all! Note that I was pointing out that it isn't a free lunch, not that it isn't worthwhile - see here for details of the proposal :)

alexander-beedie avatar Jan 26 '23 02:01 alexander-beedie

Good news :-)

AlexandreDecan avatar Jan 26 '23 07:01 AlexandreDecan

Coming back to the naming of with_columns, on Discord there was the suggestion select_with, which emphasizes that with_columns(new_expr) is equivalent to .select([pl.all(), new_expr])). I also like that it no longer has the columns in the name, we also do not have filter_rows but filter, select rather than select_columns, etc. Downside is that with_columns is somewhat established, but I think overall this is an improvement. Any thoughts?

zundertj avatar Feb 04 '23 09:02 zundertj

It doesn't really highlight that the data frame is returned as-is with new columns (ie there is no projection).

AlexandreDecan avatar Feb 04 '23 09:02 AlexandreDecan

I really don't think we should remove with_columns. The expressions are added to the DataFrame as new columns. So I think the name is perfect.

ritchie46 avatar Feb 04 '23 12:02 ritchie46

This consolidation was completed for the 0.16 release, with a deprecation period via transparent @redirect. Prepare to autocomplete that extra "s" , there is now just one method ;) 👍

alexander-beedie avatar Feb 11 '23 07:02 alexander-beedie

Thanks @zundertj for bringing up this topic here. I was the one who recommended select_with on Discord, so I'm adding my comment here.

It's obvious to me that with_columns adds new columns but I think its equivalence to select(pl.all()) doesn't contribute to a simple and consistent terminology.

The latest move from with_column to with_columns shows a desire to add new keywords sparingly rather than for convenience. I think this goal can be applied further to select with an arg to select all (perhaps v messy,), or to use select_with (which becomes a polars-specific keyword but which appears beside the select in the api).

Since there's the extensible api, perhaps another approach that doesn't remove the conveniences is to move with_columns into a utility 'crate' to mimic the myriad functions that motivated the introduction of with_column and with_columns in the first place. It would only be used by those coming from pyspark-ish environments, and I imagine they'd want more than just with_columns.

Seeing the pyspark api for select, selectExpr, withColumn, withColumns, withColumnRenamed, I'm thankful that polars with_column is being obviated by with_columns

Looking forward to any feedback

shahankhatch avatar Feb 14 '23 02:02 shahankhatch