polars feat(python): Add remap method.

Dec 23 '22 17:12 ghuls

WIP.

Not sure about the name remap Maybe map_dictor something like that, might be better. Suggestions for better method argument names are welcome too.

TODO:

Add tests
Dispatch to Series
Add methods to documentation
Add fast path if no default argument is set (or is None).

https://github.com/pola-rs/polars/issues/2377 would have been useful to have. Now I had to use:

.with_column(pli.lit(True).alias(is_remapped_column)),
...
pli.when(pli.col(is_remapped_column).is_not_null())
```

Dec 23 '22 17:12 ghuls

I think map_dict is a better name, as it tells you that you are doing a map, with a dict as input, in contrast to the current map which takes in any callable. remap could also be about updating column names or something.

The nicest feature, but that is probably much more work, is if we could get this as an expression as well. Currently, you have four parameters, only two would be needed for that:

        column: str,
        remapping: dict[Any, Any],
        default: Any = None,
        new_column: str | None = None,

pl.col("input_col").map_dict({"hello": 1, "world", 2}, default="default text here").alias("output_col")

As for the naming:

column and new_column seem fine. Alternatively, output_column?
remapping: maybe just dict?
default: I'm trying to find other methods which have this sort of placeholder value (for consistency), I guess we sometimes use fill_value? For example https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.unstack.html.

Dec 23 '22 18:12 zundertj

I also think map_dict might be better. I think, dict is a reserved keyword, so overriding it is not great. mapping might be already better.

fill_value might indeed be better than default. Altought with an pl.col(X) value as a default value, fill_value might look weird.

Instead of new_column (recycled from read_csv) , alias would be another option. Although the use of alias in polars syntax was always slightly weird for me, but would be more consistent.

I like the following syntax more, but as we need join, I don't know how to get the whole dataframe from the column expression. I assume over needs access to the whole dataframe too, so it might be possible.

pl.col("input_col").map_dict({"hello": 1, "world", 2}, default="default text here").alias("output_col")

Dec 23 '22 21:12 ghuls

Is it reasonable to say that map could take either a Callable or a dictionary? Then the default value could be handle via fill_null and new column via alias

Dec 23 '22 22:12 gab23r

First of all thanks for working on this <3

Naming: I think map_dict + mapping + default is fine (map would be great but that is already taken and I guess overloading is not possible?)

Implementation: I don't know the implementation details but I wonder two things:

why is this implemented on the DataFrame? Not sure if/why this is needed but as an end-user I see use cases for expressions (pl.col("a").map_dict(d)) and for series (s.map_dict(d)). Pandas also only has this on Series and Index and not on the DataFrame
If this is implemented on DataFrame I think we should remove column and new_column parameters and apply the mapping to all columns. This would be consistent with other methods on DataFrame like map or max that apply to all columns. (On DataFrame: all columns, On Series/Expression: only specified columns)

What do you think?

Dec 23 '22 22:12 Julian-J-S

So we have various map methods & functions currently:

LazyFrame.map => take in callable, return lazy frame
Expr.map => takes in callable, return expression
pl.map => like, Expr.map, but with the expression as first argument (rather than object), and no agg_list parameter

DataFrame.map has not even been taken yet (to my surprise). If we are going with map_dict, actually overriding map by passing allowing one to pass in a dict rather instead of a callable seems not such a large step. Operating on a dict is basically operating on the callable dict.get, but ideally in an optimized way.

Dec 24 '22 10:12 zundertj

@ghuls could we do this with a map_dict expression? I don't feel this should be a method an the DataFrame/LazyFrame.

The function that gets the Series could turn this into a DataFrame and do the join on that internally. Or am I missing a complicating factor?

Dec 24 '22 12:12 ritchie46

@ghuls could we do this with a map_dict expression? I don't feel this should be a method an the DataFrame/LazyFrame.

The function that gets the Series could turn this into a DataFrame and do the join on that internally. Or am I missing a complicating factor?

I forgot about the possibility of making a DataFrame from a Series. It should indeed work.

Dec 27 '22 23:12 ghuls

Definitely feels like an expression; also, rather than continuing to overload the name map, how about a clean break and call the expression switch, to declare what is done (switching values), rather map_dict which (at least to me 😅) feels more like it declares how it is done? (with a dict).

This could be my C++ past coming back to haunt me, but (structurally) it follows exactly same pattern, so there is a history/precedent for the name...

#include <iostream>
using namespace std;

// trivial example 
int main () {
   char country_code = 'FR';

   switch( country_code ) {
      case 'FR' :
         cout << "France" << endl; 
         break;
      case 'ES' :
         cout << "Spain" << endl;
         break;
      default :
         cout << "Unknown" << endl;
   }
   return 0;
}

...and as a polars expression:

pl.col('country_code').switch( 
  {'FR':'France', 'ES':'Spain'}, default='Unknown'
)

If not, I think pl.col('xyz').remap probably does looks cleaner than pl.col('xyz').map_dict, but that's not a hill I'd die on... ;)

Jan 02 '23 05:01 alexander-beedie

Yes, it should be an expression. At the time I couldn't figure out how to do it due the use of join (forgot that you can transform a series to a dataframe).

I just looked for some synonyms: https://www.wordhippo.com/what-is/another-word-for/switch.html

swap might be a nice and clear one. I also likesubstitute, but it is a bit long and substitution feels to closely to substring subsitution replacement .

Jan 02 '23 14:01 ghuls

Yes, it should be an expression. At the time I couldn't figure out how to do it due the use of join (forgot that you can transform a series to a dataframe).

I just looked for some synonyms: https://www.wordhippo.com/what-is/another-word-for/switch.html

swap might be a nice and clear one. I also likesubstitute, but it is a bit long and substitution feels to closely to substring subsitution replacement .

I agree that swap is very clear for the once were there is a key but for for the ones with no match you would expect swap to keep the original. So remap might be better.

Jan 03 '23 10:01 thomasfrederikhoeck

@ghuls any updates on this one? I can take it from here if you like.

Jan 31 '23 07:01 ritchie46

@ghuls any updates on this one? I can take it from here if you like.

Looking at it again.

Feb 09 '23 20:02 ghuls

@ritchie46 I was looking back in Discord to check how to implement remap as an expression instead of on the dataframe, but it looks like currently it is not possible according to your comment at the time:

Getting the df in an expression is a bit harder. It would need a dedicated expression. In the rust physical plan

Or is there another way? (I tried pl.select(self), but that does not seem to work).

Feb 09 '23 22:02 ghuls

Create a map function that accepts a Series. The Series can convert to a DataFrame when the mapped function is executed.

Feb 10 '23 05:02 ritchie46

Thanks @ghuls. I took your code and converted it to an expression. I don't think it should be on the frames. This also removes the need for an output column name.

Feb 10 '23 10:02 ritchie46

Ah, thanks. I was just working on the same. I also managed to get it to work if you select another column as default.

Feb 10 '23 10:02 ghuls

Thanks @ghuls. I took your code and converted it to an expression. I don't think it should be on the frames. This also removes the need for an output column name.

@ritchie46: Can we change the name before release? I think remap was looking like the favourite (though I'd still vote for switch, heh)

Feb 10 '23 11:02 alexander-beedie

I think map_dict is most explicit. I would need to lookup remap and be wondering how it relates to map itself.

Feb 10 '23 11:02 ritchie46

Fair enough; I do not feel strongly about this, I'm just happy the feature exists :)

(Thx again @ghuls!)

Feb 10 '23 12:02 alexander-beedie

Ah, thanks. I was just working on the same. I also managed to get it to work if you select another column as default.

Using other column expressions than the column that is map_dict-ed for the default value, is possible with: https://github.com/pola-rs/polars/pull/6781

Feb 10 '23 13:02 ghuls

Regarding the name, I had suggested lookup (possibly dict_lookup) in #3789, but IMO map_dict says everything it needs to say. It really is just a map operation with a fixed dict instead of an arbitrary function, after all. I like it.

Being able to throw on missing value instead of defaulting could be practical in some cases, but this is already merged, so I guess it would have to be for a separate feature PR.

Feb 10 '23 14:02 sm-Fifteen

polars polars copied to clipboard

feat(python): Add remap method.

polars
polars copied to clipboard