polars
polars copied to clipboard
feat(python): Add remap method.
WIP.
Not sure about the name remap
Maybe map_dictor something like that, might be better.
Suggestions for better method argument names are welcome too.
TODO:
- Add tests
- Dispatch to Series
- Add methods to documentation
- Add fast path if no default argument is set (or is None).
https://github.com/pola-rs/polars/issues/2377 would have been useful to have. Now I had to use:
.with_column(pli.lit(True).alias(is_remapped_column)),
...
pli.when(pli.col(is_remapped_column).is_not_null())
```
I think map_dict is a better name, as it tells you that you are doing a map, with a dict as input, in contrast to the current map which takes in any callable. remap could also be about updating column names or something.
The nicest feature, but that is probably much more work, is if we could get this as an expression as well. Currently, you have four parameters, only two would be needed for that:
column: str,
remapping: dict[Any, Any],
default: Any = None,
new_column: str | None = None,
pl.col("input_col").map_dict({"hello": 1, "world", 2}, default="default text here").alias("output_col")
As for the naming:
columnandnew_columnseem fine. Alternatively,output_column?remapping: maybe justdict?default: I'm trying to find other methods which have this sort of placeholder value (for consistency), I guess we sometimes usefill_value? For example https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.unstack.html.
I also think map_dict might be better.
I think, dict is a reserved keyword, so overriding it is not great. mapping might be already better.
fill_value might indeed be better than default. Altought with an pl.col(X) value as a default value, fill_value might look weird.
Instead of new_column (recycled from read_csv) , alias would be another option. Although the use of alias in polars syntax was always slightly weird for me, but would be more consistent.
I like the following syntax more, but as we need join, I don't know how to get the whole dataframe from the column expression. I assume over needs access to the whole dataframe too, so it might be possible.
pl.col("input_col").map_dict({"hello": 1, "world", 2}, default="default text here").alias("output_col")
Is it reasonable to say that map could take either a Callable or a dictionary? Then the default value could be handle via fill_null and new column via alias
First of all thanks for working on this <3
Naming:
I think map_dict + mapping + default is fine (map would be great but that is already taken and I guess overloading is not possible?)
Implementation: I don't know the implementation details but I wonder two things:
- why is this implemented on the DataFrame? Not sure if/why this is needed but as an end-user I see use cases for expressions (
pl.col("a").map_dict(d)) and for series (s.map_dict(d)). Pandas also only has this on Series and Index and not on the DataFrame - If this is implemented on DataFrame I think we should remove
columnandnew_columnparameters and apply the mapping to all columns. This would be consistent with other methods on DataFrame likemapormaxthat apply to all columns. (On DataFrame: all columns, On Series/Expression: only specified columns)
What do you think?
So we have various map methods & functions currently:
LazyFrame.map=> take in callable, return lazy frameExpr.map=> takes in callable, return expressionpl.map=> like,Expr.map, but with the expression as first argument (rather than object), and noagg_listparameter
DataFrame.map has not even been taken yet (to my surprise). If we are going with map_dict, actually overriding map by passing allowing one to pass in a dict rather instead of a callable seems not such a large step. Operating on a dict is basically operating on the callable dict.get, but ideally in an optimized way.
@ghuls could we do this with a map_dict expression? I don't feel this should be a method an the DataFrame/LazyFrame.
The function that gets the Series could turn this into a DataFrame and do the join on that internally. Or am I missing a complicating factor?
@ghuls could we do this with a
map_dictexpression? I don't feel this should be a method an theDataFrame/LazyFrame.The function that gets the
Seriescould turn this into aDataFrameand do the join on that internally. Or am I missing a complicating factor?
I forgot about the possibility of making a DataFrame from a Series. It should indeed work.
Definitely feels like an expression; also, rather than continuing to overload the name map, how about a clean break and call the expression switch, to declare what is done (switching values), rather map_dict which (at least to me 😅) feels more like it declares how it is done? (with a dict).
This could be my C++ past coming back to haunt me, but (structurally) it follows exactly same pattern, so there is a history/precedent for the name...
#include <iostream>
using namespace std;
// trivial example
int main () {
char country_code = 'FR';
switch( country_code ) {
case 'FR' :
cout << "France" << endl;
break;
case 'ES' :
cout << "Spain" << endl;
break;
default :
cout << "Unknown" << endl;
}
return 0;
}
...and as a polars expression:
pl.col('country_code').switch(
{'FR':'France', 'ES':'Spain'}, default='Unknown'
)
If not, I think pl.col('xyz').remap probably does looks cleaner than pl.col('xyz').map_dict, but that's not a hill I'd die on... ;)
Yes, it should be an expression. At the time I couldn't figure out how to do it due the use of join (forgot that you can transform a series to a dataframe).
I just looked for some synonyms: https://www.wordhippo.com/what-is/another-word-for/switch.html
swap might be a nice and clear one.
I also likesubstitute, but it is a bit long and substitution feels to closely to substring subsitution replacement .
Yes, it should be an expression. At the time I couldn't figure out how to do it due the use of join (forgot that you can transform a series to a dataframe).
I just looked for some synonyms: https://www.wordhippo.com/what-is/another-word-for/switch.html
swapmight be a nice and clear one. I also likesubstitute, but it is a bit long and substitution feels to closely to substring subsitution replacement .
I agree that swap is very clear for the once were there is a key but for for the ones with no match you would expect swap to keep the original. So remap might be better.
@ghuls any updates on this one? I can take it from here if you like.
@ghuls any updates on this one? I can take it from here if you like.
Looking at it again.
@ritchie46 I was looking back in Discord to check how to implement remap as an expression instead of on the dataframe, but it looks like currently it is not possible according to your comment at the time:
Getting the df in an expression is a bit harder. It would need a dedicated expression. In the rust physical plan
Or is there another way? (I tried pl.select(self), but that does not seem to work).
Create a map function that accepts a Series. The Series can convert to a DataFrame when the mapped function is executed.
Thanks @ghuls. I took your code and converted it to an expression. I don't think it should be on the frames. This also removes the need for an output column name.
Ah, thanks. I was just working on the same. I also managed to get it to work if you select another column as default.
Thanks @ghuls. I took your code and converted it to an expression. I don't think it should be on the frames. This also removes the need for an output column name.
@ritchie46: Can we change the name before release? I think remap was looking like the favourite (though I'd still vote for switch, heh)
I think map_dict is most explicit. I would need to lookup remap and be wondering how it relates to map itself.
Fair enough; I do not feel strongly about this, I'm just happy the feature exists :)
(Thx again @ghuls!)
Ah, thanks. I was just working on the same. I also managed to get it to work if you select another column as default.
Using other column expressions than the column that is map_dict-ed for the default value, is possible with: https://github.com/pola-rs/polars/pull/6781
Regarding the name, I had suggested lookup (possibly dict_lookup) in #3789, but IMO map_dict says everything it needs to say. It really is just a map operation with a fixed dict instead of an arbitrary function, after all. I like it.
Being able to throw on missing value instead of defaulting could be practical in some cases, but this is already merged, so I guess it would have to be for a separate feature PR.